-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unneeded errors from CRD #3
Conversation
a364764
to
15f46b0
Compare
1e3b1b4
to
e362fb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Seems like you went for a bit of a dive here!
flyteplugins/go/tasks/pluginmachinery/flytek8s/config/config.go
Outdated
Show resolved
Hide resolved
} | ||
if in.StartedAt == nil { | ||
in.StartedAt = &n | ||
} | ||
if in.LastAttemptStartedAt == nil { | ||
in.LastAttemptStartedAt = &n | ||
} | ||
} | ||
in.LastUpdatedAt = &n | ||
|
||
// For cases in which the node is either Succeeded or Skipped we clear most fields from the status | ||
// except for StoppedAt and Phase. StoppedAt is used to calculate transition latency between this node and | ||
// any downstream nodes and Phase is required for propeller to continue to downstream nodes. | ||
if p == NodePhaseSucceeded || p == NodePhaseSkipped { | ||
// Clear most status related fields after reaching a terminal state. This keeps the CRD state small to avoid | ||
// etcd size limits. Importantly we keep Phase, StoppedAt and Error which will be needed further. | ||
// Errors will still be needed but it will be cleaned up when possible because they can be very large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can gate this behind a flag? I know some users like having error messages and status' persist in the CR when nodes fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a way but I don't really know how. Do you have any suggestions?
I think one way would be to add a ClearStateOnTermination
attribute to NodeStatus
. Then if we set it correctly when creating the node everything should work after that. The disadvantage of that though is we create a new parameter that needs to be stored in etcd. Personally I think it would be preferable to pass in an argument to UpdatePhase
. It seems like this would require a change to a commonly used interface though and I don't really know what the implications of that would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a draft upstream PR so its probably best to discuss there flyteorg#4596
startedAt := nodeStatus.GetStartedAt() | ||
if startedAt == nil { | ||
startedAt = &t | ||
} | ||
nodeStatus.UpdatePhase(v1alpha1.NodePhaseFailed, t, nodeStatus.GetMessage(), nodeStatus.GetExecutionError()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just because we already delete the startedAt
timestamp and we need it to observe the FailureDuration
metric below right? IIUC this will make the metric useless because we set startedAt
to Now()
, is this correct?
@@ -297,6 +297,7 @@ func (c *recursiveNodeExecutor) handleDownstream(ctx context.Context, execContex | |||
// If the failure policy allows other nodes to continue running, do not exit the loop, | |||
// Keep track of the last failed state in the loop since it'll be the one to return. | |||
// TODO: If multiple nodes fail (which this mode allows), consolidate/summarize failure states in one. | |||
stateOnComplete.ResetError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we keep the error in the CR because it is necessary for reporting in admin events. Then once it's reported we delete it right? My intuition says this should be behind the same configuration flag as deleting all the metadata for terminal nodes in failed states. Thoughts?
3877f0c
to
c15e9a2
Compare
e362fb3
to
8e5e002
Compare
c15e9a2
to
40c378f
Compare
40c378f
to
a4755d9
Compare
This reverts commit dab428d.
8e5e002
to
03e2e0c
Compare
Upstream PR flyteorg#4596 |
Tracking issue
Describe your changes
Check all the applicable boxes
Screenshots
Note to reviewers