-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tracking for active node and task execution counts in propeller #4986
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4986 +/- ##
==========================================
- Coverage 59.10% 59.06% -0.04%
==========================================
Files 645 646 +1
Lines 55581 55714 +133
==========================================
+ Hits 32851 32910 +59
- Misses 20136 20208 +72
- Partials 2594 2596 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
8a948ce
to
1254afa
Compare
Signed-off-by: Shardool <[email protected]>
Signed-off-by: Shardool <[email protected]>
1254afa
to
52c023b
Compare
Signed-off-by: Shardool <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
@sshardool this PR is great. Have you noticed any increases in latency for workflow executions with this running? How many workers/threads are running for your propeller deployments? I figure this should only add in the order of nanoseconds for propeller loops with lock contention with ExecutionStatsHolder updates so it shouldn't be a concern. We might want to run this on an internal cluster for a little bit prior to merging, but this PR LGTM. |
Thanks for the review @pvditt . We currently have this running with 4 worker threads and have not seen any increase in execution latencies. You're right, critical section for all operations which are inline with the work-queue tasks is very small (single map entry update). Let me know if an additional metric tracking the wait duration (i.e time taken for mutex acquisition) in |
Signed-off-by: Paul Dittamo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for adding this
Congrats on merging your first pull request! 🎉 |
…lyteorg#4986) * Add tracking for active node and task execution counts in propeller Signed-off-by: Shardool <[email protected]> * Update unit tests for task and node execution counts Signed-off-by: Shardool <[email protected]> * Fix linter errors Signed-off-by: Shardool <[email protected]> * fix linter errors Signed-off-by: Paul Dittamo <[email protected]> --------- Signed-off-by: Shardool <[email protected]> Signed-off-by: Paul Dittamo <[email protected]> Co-authored-by: Paul Dittamo <[email protected]>
Tracking issue
#4593
Why are the changes needed?
Fixes this from the issue description
What changes were proposed in this pull request?
The metrics for active node and task executions are added to flytepropeller with the expectation that the workflow re-evaluation duration is an acceptable delay for updating the metrics (eventually consistent)
How was this patch tested?
Tested on a dev flyte setup (single node).
Tested is in progress in larger scale internal deployments.
Sample of the new metrics: