Remove unneeded errors from CRD #3

Tom-Newton · 2023-11-15T22:24:26Z

Tracking issue

Describe your changes

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Screenshots

Note to reviewers

hamersaw

Looks great! Seems like you went for a bit of a dive here!

flyteplugins/go/tasks/pluginmachinery/flytek8s/config/config.go

hamersaw · 2023-11-29T17:24:26Z

flytepropeller/pkg/apis/flyteworkflow/v1alpha1/node_status.go

 		}
-		if in.StartedAt == nil {
-			in.StartedAt = &n
-		}
-		if in.LastAttemptStartedAt == nil {
-			in.LastAttemptStartedAt = &n
-		}
-	}
-	in.LastUpdatedAt = &n
-
-	// For cases in which the node is either Succeeded or Skipped we clear most fields from the status
-	// except for StoppedAt and Phase. StoppedAt is used to calculate transition latency between this node and
-	// any downstream nodes and Phase is required for propeller to continue to downstream nodes.
-	if p == NodePhaseSucceeded || p == NodePhaseSkipped {
+		// Clear most status related fields after reaching a terminal state. This keeps the CRD state small to avoid 
+		// etcd size limits. Importantly we keep Phase, StoppedAt and Error which will be needed further. 
+		// Errors will still be needed but it will be cleaned up when possible because they can be very large.  


Is there a way we can gate this behind a flag? I know some users like having error messages and status' persist in the CR when nodes fail.

There must be a way but I don't really know how. Do you have any suggestions?

I think one way would be to add a ClearStateOnTermination attribute to NodeStatus. Then if we set it correctly when creating the node everything should work after that. The disadvantage of that though is we create a new parameter that needs to be stored in etcd. Personally I think it would be preferable to pass in an argument to UpdatePhase. It seems like this would require a change to a commonly used interface though and I don't really know what the implications of that would be.

I opened a draft upstream PR so its probably best to discuss there flyteorg#4596

hamersaw · 2023-11-29T17:26:17Z

flytepropeller/pkg/controller/nodes/executor.go

+		startedAt := nodeStatus.GetStartedAt()
+		if startedAt == nil {
+			startedAt = &t
+		}
+		nodeStatus.UpdatePhase(v1alpha1.NodePhaseFailed, t, nodeStatus.GetMessage(), nodeStatus.GetExecutionError())


This is just because we already delete the startedAt timestamp and we need it to observe the FailureDuration metric below right? IIUC this will make the metric useless because we set startedAt to Now(), is this correct?

hamersaw · 2023-11-29T17:27:38Z

flytepropeller/pkg/controller/nodes/executor.go

@@ -297,6 +297,7 @@ func (c *recursiveNodeExecutor) handleDownstream(ctx context.Context, execContex
 				// If the failure policy allows other nodes to continue running, do not exit the loop,
 				// Keep track of the last failed state in the loop since it'll be the one to return.
 				// TODO: If multiple nodes fail (which this mode allows), consolidate/summarize failure states in one.
+				stateOnComplete.ResetError()


This means we keep the error in the CR because it is necessary for reporting in admin events. Then once it's reported we delete it right? My intuition says this should be behind the same configuration flag as deleting all the metadata for terminal nodes in failed states. Thoughts?

This reverts commit dab428d.

…rrors

Tom-Newton · 2023-12-13T18:44:13Z

Upstream PR flyteorg#4596

Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from a364764 to 15f46b0 Compare November 16, 2023 17:49

Tom-Newton changed the base branch from tomnewton/expreiment_with_complete_solution to tomnewton/collapse_sub_nodes_on_failures November 17, 2023 10:04

Tom-Newton changed the title ~~Tomnewton/remove error messages from crd~~ Remove unneeded errors from CRD Nov 17, 2023

Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 1e3b1b4 to e362fb3 Compare November 27, 2023 21:31

hamersaw reviewed Nov 29, 2023

View reviewed changes

Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 3877f0c to c15e9a2 Compare November 29, 2023 20:01

Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from e362fb3 to 8e5e002 Compare November 29, 2023 20:04

Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from c15e9a2 to 40c378f Compare November 29, 2023 20:05

Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 40c378f to a4755d9 Compare December 13, 2023 14:21

Tom-Newton changed the base branch from tomnewton/collapse_sub_nodes_on_failures to master December 13, 2023 14:21

Tom-Newton added 19 commits December 13, 2023 14:23

ClearSubNodeStatus on failure

8f37882

More aggressive collapsing

789b65e

Tidy

a27e6f7

Fix panic

f3ce9bb

Tidy

92be6a1

Handle possibility of nil startedAt time

e6f245f

Update test assertions

8a2f03a

Don't track node errors

88d54d9

Wipe node error after its collected

dd03dee

Revert "Don't track node errors"

4cf31b6

This reverts commit dab428d.

Try clearing error message without breaking upstream propagation of e…

3c23c80

…rrors

Fix clearing error message

ef42762

Create a copy of the error to return

f83cd35

Working error propagates to imdiate execution

6593612

Tidy

c4a5be0

Copy state when not failing immediately

52f90cc

Reset errors only when recording a new error

38ed55b

Tidy

337e308

White space

03e2e0c

Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 8e5e002 to 03e2e0c Compare December 13, 2023 14:32

Tom-Newton closed this Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unneeded errors from CRD #3

Remove unneeded errors from CRD #3

Tom-Newton commented Nov 15, 2023

hamersaw left a comment

hamersaw Nov 29, 2023

Tom-Newton Dec 13, 2023 •

edited

Loading

Tom-Newton Dec 13, 2023

hamersaw Nov 29, 2023

hamersaw Nov 29, 2023

Tom-Newton commented Dec 13, 2023

Remove unneeded errors from CRD #3

Remove unneeded errors from CRD #3

Conversation

Tom-Newton commented Nov 15, 2023

Tracking issue

Describe your changes

Check all the applicable boxes

Screenshots

Note to reviewers

hamersaw left a comment

Choose a reason for hiding this comment

hamersaw Nov 29, 2023

Choose a reason for hiding this comment

Tom-Newton Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Tom-Newton Dec 13, 2023

Choose a reason for hiding this comment

hamersaw Nov 29, 2023

Choose a reason for hiding this comment

hamersaw Nov 29, 2023

Choose a reason for hiding this comment

Tom-Newton commented Dec 13, 2023

Tom-Newton Dec 13, 2023 •

edited

Loading