Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unneeded errors from CRD #3

Closed
wants to merge 19 commits into from

Conversation

Tom-Newton
Copy link
Owner

Tracking issue

Describe your changes

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Screenshots

Note to reviewers

@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from a364764 to 15f46b0 Compare November 16, 2023 17:49
@Tom-Newton Tom-Newton changed the base branch from tomnewton/expreiment_with_complete_solution to tomnewton/collapse_sub_nodes_on_failures November 17, 2023 10:04
@Tom-Newton Tom-Newton changed the title Tomnewton/remove error messages from crd Remove unneeded errors from CRD Nov 17, 2023
@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 1e3b1b4 to e362fb3 Compare November 27, 2023 21:31
Copy link

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Seems like you went for a bit of a dive here!

Comment on lines 631 to +634
}
if in.StartedAt == nil {
in.StartedAt = &n
}
if in.LastAttemptStartedAt == nil {
in.LastAttemptStartedAt = &n
}
}
in.LastUpdatedAt = &n

// For cases in which the node is either Succeeded or Skipped we clear most fields from the status
// except for StoppedAt and Phase. StoppedAt is used to calculate transition latency between this node and
// any downstream nodes and Phase is required for propeller to continue to downstream nodes.
if p == NodePhaseSucceeded || p == NodePhaseSkipped {
// Clear most status related fields after reaching a terminal state. This keeps the CRD state small to avoid
// etcd size limits. Importantly we keep Phase, StoppedAt and Error which will be needed further.
// Errors will still be needed but it will be cleaned up when possible because they can be very large.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can gate this behind a flag? I know some users like having error messages and status' persist in the CR when nodes fail.

Copy link
Owner Author

@Tom-Newton Tom-Newton Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be a way but I don't really know how. Do you have any suggestions?

I think one way would be to add a ClearStateOnTermination attribute to NodeStatus. Then if we set it correctly when creating the node everything should work after that. The disadvantage of that though is we create a new parameter that needs to be stored in etcd. Personally I think it would be preferable to pass in an argument to UpdatePhase. It seems like this would require a change to a commonly used interface though and I don't really know what the implications of that would be.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a draft upstream PR so its probably best to discuss there flyteorg#4596

Comment on lines +1326 to +1334
startedAt := nodeStatus.GetStartedAt()
if startedAt == nil {
startedAt = &t
}
nodeStatus.UpdatePhase(v1alpha1.NodePhaseFailed, t, nodeStatus.GetMessage(), nodeStatus.GetExecutionError())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just because we already delete the startedAt timestamp and we need it to observe the FailureDuration metric below right? IIUC this will make the metric useless because we set startedAt to Now(), is this correct?

@@ -297,6 +297,7 @@ func (c *recursiveNodeExecutor) handleDownstream(ctx context.Context, execContex
// If the failure policy allows other nodes to continue running, do not exit the loop,
// Keep track of the last failed state in the loop since it'll be the one to return.
// TODO: If multiple nodes fail (which this mode allows), consolidate/summarize failure states in one.
stateOnComplete.ResetError()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we keep the error in the CR because it is necessary for reporting in admin events. Then once it's reported we delete it right? My intuition says this should be behind the same configuration flag as deleting all the metadata for terminal nodes in failed states. Thoughts?

@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 3877f0c to c15e9a2 Compare November 29, 2023 20:01
@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from e362fb3 to 8e5e002 Compare November 29, 2023 20:04
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from c15e9a2 to 40c378f Compare November 29, 2023 20:05
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 40c378f to a4755d9 Compare December 13, 2023 14:21
@Tom-Newton Tom-Newton changed the base branch from tomnewton/collapse_sub_nodes_on_failures to master December 13, 2023 14:21
@Tom-Newton Tom-Newton force-pushed the tomnewton/remove_error_messages_from_crd branch from 8e5e002 to 03e2e0c Compare December 13, 2023 14:32
@Tom-Newton
Copy link
Owner Author

Upstream PR flyteorg#4596

@Tom-Newton Tom-Newton closed this Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants