Fix: Make distributed error aggregation opt-in #6103
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
For RFC #5598, flytepropeller was given the ability to list error files in the so-called raw output prefix bucket of an execution with the goal of identifying which worker pod in a failed distributed task experienced the first error.
In GCP, listing the error files requires the
"storage.objects.list"
permission which so far wasn't given to propeller. I added this permission to the Flyte propeller custom role here.That being said, because this feature is therefore not backwards compatible, I propose to make it opt-in.
If you agree with this, I'll make another PR to document this feature and how to activate it here and/or here.
What changes were proposed in this pull request?
Only search for multiple error files from the different workers of a distributed task as proposed in RFC #5598 if actively enabled in the flytepropeller config in order to not strictly require the addition of the
"storage.objects.list"
permission.How was this patch tested?
Ran flytepropeller with/without the flag enabled locally for a GKE based deployment and adapted unit tests.
Check all the applicable boxes