Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Retry fetching subworkflow output data on failure #4602

Merged
merged 18 commits into from
Jan 16, 2024

Conversation

pvditt
Copy link
Contributor

@pvditt pvditt commented Dec 15, 2023

Tracking issue

Closes: #4369

Why are the changes needed?

  • Errors are not bubbled up when fetching a subworkflow's output fails. This causes for failed output fetches to be handled later on when reading nil output literals leading to unclear/confusing error messages.

- When remote store reads fail in GetOutputs or GetInputs, errors are not bubbled up - returning the URL blob to the client instead. However, the URL blob is deprecated. Sending a clear error status improves clarity. Update: moving this to another PR

  • GetLimitMegabytes could potentially differ between propeller and admin given certain deployment configuration potentially causing for the GetExecutionData call to fetch subworklow output data could always fail while succeed if called on propeller. There is ~similar logic utilized in recoverInputs.

What changes were proposed in this pull request?

  • Retry fetching subworkflow output data on propeller when GetExecutionData admin call fails.
    - Bubble up errors when fetching Input and Output for workflow, node and task executions instead of returning empty values. Update: moving this to another PR
    - Remove UrlBlob utilization when fetching Input and Output for workflow, node and task execution Update: moving this to another PR

How was this patch tested?

  • Added/updated unit tests
  • Ran subworkflows.py parent_workflow in flytesnack to see if GetExecutionData + manually set GetExecutionData to fail to test if the remote store retry in propeller would work.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working labels Dec 15, 2023
@pvditt pvditt requested a review from hamersaw December 15, 2023 01:03
@Future-Outlier
Copy link
Member

Amazing PR, I am bothered by this problem ...
I will test this.

@Future-Outlier

This comment was marked as resolved.

@Future-Outlier

This comment was marked as resolved.

@pvditt

This comment was marked as resolved.

@Future-Outlier

This comment was marked as resolved.

@hamersaw

This comment was marked as resolved.

@pvditt

This comment was marked as outdated.

Copy link

codecov bot commented Dec 22, 2023

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (ac42562) 58.13% compared to head (1f161f9) 58.21%.
Report is 7 commits behind head on master.

Files Patch % Lines
flytepropeller/pkg/controller/controller.go 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4602      +/-   ##
==========================================
+ Coverage   58.13%   58.21%   +0.07%     
==========================================
  Files         626      626              
  Lines       53786    53796      +10     
==========================================
+ Hits        31271    31316      +45     
+ Misses      20007    19972      -35     
  Partials     2508     2508              
Flag Coverage Δ
unittests 58.21% <72.72%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pvditt

This comment was marked as outdated.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 27, 2023
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 12, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jan 13, 2024
@pvditt
Copy link
Contributor Author

pvditt commented Jan 13, 2024

@hamersaw I removed the FlyteAdmin changes from this PR. Will open up a house keeping PR with those changes.

@pvditt pvditt changed the title [BUG] Bubble up errors on remote store reads [BUG] Retry fetching subworkflow output data on failure Jan 13, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 16, 2024
@hamersaw hamersaw merged commit 136ac8d into master Jan 16, 2024
45 checks passed
@hamersaw hamersaw deleted the bug/return-error-on-failed-storage-reads branch January 16, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants