-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clear past errors from workflow state #4624
base: master
Are you sure you want to change the base?
Clear past errors from workflow state #4624
Conversation
62f6154
to
2b76a91
Compare
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
This reverts commit dab428d. Signed-off-by: Thomas Newton <[email protected]>
…rrors Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
2b76a91
to
216e1a3
Compare
Signed-off-by: Thomas Newton <[email protected]>
852f527
to
6c5650c
Compare
Signed-off-by: Thomas Newton <[email protected]>
f7c5f80
to
2cea075
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4624 +/- ##
==========================================
- Coverage 58.20% 58.07% -0.14%
==========================================
Files 626 476 -150
Lines 53800 38119 -15681
==========================================
- Hits 31316 22138 -9178
+ Misses 19976 14056 -5920
+ Partials 2508 1925 -583
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
It looks like I'm missing a little bit of code coverage. I'll try to find some time to fix that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this logic the goal is to delete all the errors but the last one right? I'm trying to wrap my head around the logic of iterating over downstream nodes but need to dive deeper. Is there determinism in the ordering? Or is there a scenario here where we delete all of the error messages? For example, if we have two nodes (n0
and n1
) if the first time we iterate over these the order is n0
, n1
then we clear the error from n0
if the second time we iterate n1
, n0
then we clear the error from n0
and just cleared all of our errors.
@@ -298,6 +299,10 @@ func (c *recursiveNodeExecutor) handleDownstream(ctx context.Context, execContex | |||
// If the failure policy allows other nodes to continue running, do not exit the loop, | |||
// Keep track of the last failed state in the loop since it'll be the one to return. | |||
// TODO: If multiple nodes fail (which this mode allows), consolidate/summarize failure states in one. | |||
if executableNodeStatusOnComplete != nil { | |||
c.nodeExecutor.Clear(executableNodeStatusOnComplete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add the enableCRDebugMetadata bool
argument here to the ClearExecutionError
function on the MutableNodeStatus
interface. Then this call more closely reflects the UpdatePhase
call above and we can remove adding a Clear
function to the nodeExecutor
struct and similarly the NodeExecutor
interface. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used golang before I started using flyte, so I have little opinion on how the interfaces are organised. I will try to implement it as you suggested.
That's a good question. I guess I assumed that |
Unfortunately I've been very distracted from this recently but I do plan to come back to it. |
Tracking issue
#4569
Why are the changes needed?
Reduce un-needed information stored in etcd when using
failure_policy=WorkflowFailurePolicy.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE
. This allows flyte to scale to larger workflows before hitting etcd size limits.What changes were proposed in this pull request?
node-config.enable-cr-debug-metadata
config option. Set this to true to restore the previous behaviour.FAIL_AFTER_EXECUTABLE_NODES_COMPLETE
. Without this the workflow will fail as soon as there is one failure so there can never be more than one error regardless of this PR.enable-cr-debug-metadata
config option.TestWorkflowExecutor_HandleFlyteWorkflow_Failing
to have sub test cases for combinations ofFAIL_AFTER_EXECUTABLE_NODES_COMPLETE
andenable-cr-debug-metadata
How was this patch tested?
Updated unittests
I have been running something very similar to this in our prod deployment for some time.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Follow up to #4596
Docs link