-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Handle auto refresh cache race condition #5406
Conversation
Signed-off-by: Paul Dittamo <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5406 +/- ##
=======================================
Coverage 61.10% 61.10%
=======================================
Files 793 793
Lines 51156 51164 +8
=======================================
+ Hits 31257 31264 +7
- Misses 17027 17028 +1
Partials 2872 2872
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Paul Dittamo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one question
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
Is there any plan to create a fix release with this fix? |
Thank you for working on this. Once the next release is available I can test this and I'll report back if the issue is solved. |
@andresgomezfrr Yes, we are validating a new release end of this week and barring any issues will get an official release out next week. |
@andresgomezfrr @pablocasares there's a RC that contains this fix. I'm unsure of when a final release containing this change will be made. I'll ping when that happens. |
* utilize auto refresh processing set with entry expiration Signed-off-by: Paul Dittamo <[email protected]> * add unit test Signed-off-by: Paul Dittamo <[email protected]> * update processing grace period to 5 sync periods Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Paul Dittamo <[email protected]>
Tracking issue
Potentially closes: #5335
Why are the changes needed?
Propeller v1.12.0 introduced a bug in which child/external workflow status was not propagated back up to the parent workflow.
Not able to repro exactly. Current theory is that there's a race condition in which an item be in the processing set (which was introduced in new Flyte release) while not being in the workqueue. Due to
if item, ok := value.(Item); !ok || (ok && !item.IsTerminal() && !w.processing.Contains(k)) {
(in enqueueBatches), this would cause an item to no longer get added to the workqueue to then be re-synced.Why we think this happens:
the item (workflow) is still in the LruCache as we keep getting status for it in GetStatus.
If the item were not in the cache, then the item would get re-added to the workqueue. If an item were in the workqueue, then it'd be included as part of the syncItem process that's trigged in the auto_refresh's sync. Sync grabs batches off the workqueue.
enqueueBatches adds items to the workqueue. An item only gets added to the workqueue if it's not in processing among other conditions.
gorm logs indicate that admin is not getting GetExecution requests for the child workflow that's status is not updating.
the addition of the processing sync.set was the only change that stood out in between flyte 1.11 and 1.12.
What changes were proposed in this pull request?
We want to keep the processing optimization to reduce to overhead of adding duplicate items to the workqueue.
We swap out the processing set in favor of a map in which they keys are the same set and the values are a timestamp of when the item was added to processing. We then check for how long the item has been in processing - if an item has been in processing for 10 sync periods we "evict" it from processing such that the item will get re-added to the workqueue.
How was this patch tested?
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link