-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flyteadmin] Refactor panic recovery into middleware #5546
Conversation
@@ -0,0 +1,38 @@ | |||
package middleware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to changing where this lives but it feels like there should be a middleware package, and any interceptors should move here imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5546 +/- ##
==========================================
+ Coverage 35.89% 36.17% +0.27%
==========================================
Files 1301 1302 +1
Lines 109419 109388 -31
==========================================
+ Hits 39281 39570 +289
+ Misses 66041 65683 -358
- Partials 4097 4135 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
c81893e
to
6c27b60
Compare
Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Jason Parraga <[email protected]>
Signed-off-by: Jason Parraga <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thank you for refactoring and the detailed PR explanation!
@@ -0,0 +1,38 @@ | |||
package middleware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
cc @eapolinario who is looking into test failures |
* Refactor panic handling to middleware Signed-off-by: Jason Parraga <[email protected]> * Remove registration of old panicCounter Signed-off-by: Jason Parraga <[email protected]> * Add test coverage Signed-off-by: Jason Parraga <[email protected]> --------- Signed-off-by: Jason Parraga <[email protected]> Signed-off-by: Bugra Gedik <[email protected]>
* Refactor panic handling to middleware Signed-off-by: Jason Parraga <[email protected]> * Remove registration of old panicCounter Signed-off-by: Jason Parraga <[email protected]> * Add test coverage Signed-off-by: Jason Parraga <[email protected]> --------- Signed-off-by: Jason Parraga <[email protected]> Signed-off-by: Vladyslav Libov <[email protected]>
…ame (#5616) * Add environment variable for pod name Signed-off-by: Bugra Gedik <[email protected]> * [flyteadmin] Refactor panic recovery into middleware (#5546) * Refactor panic handling to middleware Signed-off-by: Jason Parraga <[email protected]> * Remove registration of old panicCounter Signed-off-by: Jason Parraga <[email protected]> * Add test coverage Signed-off-by: Jason Parraga <[email protected]> --------- Signed-off-by: Jason Parraga <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Snowflake agent Doc (#5620) * TEST build Signed-off-by: Future-Outlier <[email protected]> * remove emphasize-lines Signed-off-by: Future-Outlier <[email protected]> * test build Signed-off-by: Future-Outlier <[email protected]> * revert Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * [flytepropeller][compiler] Error Handling when Type is not found (#5612) * FlytePropeller Compiler Avoid Crash when Type not found Signed-off-by: Future-Outlier <[email protected]> * Update pingsu's error message advices Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: pingsutw <[email protected]> * fix lint Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: pingsutw <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Fix nil pointer when task plugin load returns error (#5622) Signed-off-by: Bugra Gedik <[email protected]> * Log stack trace when refresh cache sync recovers from panic (#5623) Signed-off-by: Bugra Gedik <[email protected]> * use private-key (#5626) Signed-off-by: Bugra Gedik <[email protected]> * Explain how Agent Secret Works (#5625) * first version Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Fix typo in execution manager (#5619) Signed-off-by: ddl-rliu <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Amend Admin to use grpc message size (#5628) * add send arg Signed-off-by: Yee Hing Tong <[email protected]> * Add acction to remove cache in gh runner Signed-off-by: Eduardo Apolinario <[email protected]> * Use correct checked out path Signed-off-by: Eduardo Apolinario <[email protected]> * Path in strings Signed-off-by: Eduardo Apolinario <[email protected]> * Checkout repo in root Signed-off-by: Eduardo Apolinario <[email protected]> * Use the correct path to new action Signed-off-by: Eduardo Apolinario <[email protected]> * Do not use gh var in path to clear-action-cache Signed-off-by: Eduardo Apolinario <[email protected]> * Remove wrong invocation of clear-action-cache Signed-off-by: Eduardo Apolinario <[email protected]> * GITHUB_WORKSPACE is implicit in the checkout action Signed-off-by: Eduardo Apolinario <[email protected]> * Refer to local `flyte` directory Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * document the process of setting ttl for a ray cluster (#5636) Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Add CustomHeaderMatcher to pass additional headers (#5563) Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Turn flyteidl and flytectl releases into manual gh workflows (#5635) * Make flyteidl releases go through a manual gh workflow Signed-off-by: Eduardo Apolinario <[email protected]> * Make flytectl releases go through a manual gh workflow Signed-off-by: Eduardo Apolinario <[email protected]> * Rewrite the documentation for `version` and clarify wording in RELEASE.md Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * docs: fix typo (#5643) * fix CHANGELOG-v0.2.0.md Signed-off-by: Christina <[email protected]> * fix CHANGELOG-v1.0.2-b1.md Signed-off-by: Christina <[email protected]> * fix CHANGELOG-v1.1.0.md Signed-off-by: Christina <[email protected]> * fix CHANGELOG-v1.3.0.md Signed-off-by: Christina <[email protected]> --------- Signed-off-by: Christina <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Use enable_deck=True in docs (#5645) Signed-off-by: Bugra Gedik <[email protected]> * Fix flyteidl release checkout all tags (#5646) * Fetch all tags in flyteidl-release.yml Signed-off-by: Eduardo Apolinario <[email protected]> * Fix sed expression for npm job Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Install pyarrow in sandbox functional tests (#5647) Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * docs: add documentation for configuring notifications in GCP (#5545) * update Signed-off-by: Desi Hsu <[email protected]> * dco Signed-off-by: Desi Hsu <[email protected]> * dco Signed-off-by: Desi Hsu <[email protected]> * typo Signed-off-by: Desi Hsu <[email protected]> --------- Signed-off-by: Desi Hsu <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Correct "sucessfile" to "successfile" (#5652) Signed-off-by: Bugra Gedik <[email protected]> * Fix ordering for custom template values in cluster resource controller (#5648) Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Don't error when attempting to trigger schedules for inactive projects (#5649) * Don't error when attempting to trigger schedules for inactive projects Signed-off-by: Katrina Rogan <[email protected]> * regen Signed-off-by: Katrina Rogan <[email protected]> --------- Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * fix tests Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * Fix comment symbol Signed-off-by: Eduardo Apolinario <[email protected]> * fix one more test Signed-off-by: Bugra Gedik <[email protected]> --------- Signed-off-by: Bugra Gedik <[email protected]> Signed-off-by: Jason Parraga <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: ddl-rliu <[email protected]> Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Christina <[email protected]> Signed-off-by: Desi Hsu <[email protected]> Signed-off-by: Katrina Rogan <[email protected]> Co-authored-by: Jason Parraga <[email protected]> Co-authored-by: Future-Outlier <[email protected]> Co-authored-by: pingsutw <[email protected]> Co-authored-by: ddl-rliu <[email protected]> Co-authored-by: Yee Hing Tong <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Co-authored-by: Andrew Dye <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Co-authored-by: Christina <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: desihsu <[email protected]> Co-authored-by: ShengYu <[email protected]> Co-authored-by: Katrina Rogan <[email protected]>
…ame (#5616) * Add environment variable for pod name Signed-off-by: Bugra Gedik <[email protected]> * [flyteadmin] Refactor panic recovery into middleware (#5546) * Refactor panic handling to middleware Signed-off-by: Jason Parraga <[email protected]> * Remove registration of old panicCounter Signed-off-by: Jason Parraga <[email protected]> * Add test coverage Signed-off-by: Jason Parraga <[email protected]> --------- Signed-off-by: Jason Parraga <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Snowflake agent Doc (#5620) * TEST build Signed-off-by: Future-Outlier <[email protected]> * remove emphasize-lines Signed-off-by: Future-Outlier <[email protected]> * test build Signed-off-by: Future-Outlier <[email protected]> * revert Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * [flytepropeller][compiler] Error Handling when Type is not found (#5612) * FlytePropeller Compiler Avoid Crash when Type not found Signed-off-by: Future-Outlier <[email protected]> * Update pingsu's error message advices Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: pingsutw <[email protected]> * fix lint Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> * Trigger CI Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: pingsutw <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Fix nil pointer when task plugin load returns error (#5622) Signed-off-by: Bugra Gedik <[email protected]> * Log stack trace when refresh cache sync recovers from panic (#5623) Signed-off-by: Bugra Gedik <[email protected]> * use private-key (#5626) Signed-off-by: Bugra Gedik <[email protected]> * Explain how Agent Secret Works (#5625) * first version Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Fix typo in execution manager (#5619) Signed-off-by: ddl-rliu <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Amend Admin to use grpc message size (#5628) * add send arg Signed-off-by: Yee Hing Tong <[email protected]> * Add acction to remove cache in gh runner Signed-off-by: Eduardo Apolinario <[email protected]> * Use correct checked out path Signed-off-by: Eduardo Apolinario <[email protected]> * Path in strings Signed-off-by: Eduardo Apolinario <[email protected]> * Checkout repo in root Signed-off-by: Eduardo Apolinario <[email protected]> * Use the correct path to new action Signed-off-by: Eduardo Apolinario <[email protected]> * Do not use gh var in path to clear-action-cache Signed-off-by: Eduardo Apolinario <[email protected]> * Remove wrong invocation of clear-action-cache Signed-off-by: Eduardo Apolinario <[email protected]> * GITHUB_WORKSPACE is implicit in the checkout action Signed-off-by: Eduardo Apolinario <[email protected]> * Refer to local `flyte` directory Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * document the process of setting ttl for a ray cluster (#5636) Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Add CustomHeaderMatcher to pass additional headers (#5563) Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Turn flyteidl and flytectl releases into manual gh workflows (#5635) * Make flyteidl releases go through a manual gh workflow Signed-off-by: Eduardo Apolinario <[email protected]> * Make flytectl releases go through a manual gh workflow Signed-off-by: Eduardo Apolinario <[email protected]> * Rewrite the documentation for `version` and clarify wording in RELEASE.md Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * docs: fix typo (#5643) * fix CHANGELOG-v0.2.0.md Signed-off-by: Christina <[email protected]> * fix CHANGELOG-v1.0.2-b1.md Signed-off-by: Christina <[email protected]> * fix CHANGELOG-v1.1.0.md Signed-off-by: Christina <[email protected]> * fix CHANGELOG-v1.3.0.md Signed-off-by: Christina <[email protected]> --------- Signed-off-by: Christina <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Use enable_deck=True in docs (#5645) Signed-off-by: Bugra Gedik <[email protected]> * Fix flyteidl release checkout all tags (#5646) * Fetch all tags in flyteidl-release.yml Signed-off-by: Eduardo Apolinario <[email protected]> * Fix sed expression for npm job Signed-off-by: Eduardo Apolinario <[email protected]> --------- Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Install pyarrow in sandbox functional tests (#5647) Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * docs: add documentation for configuring notifications in GCP (#5545) * update Signed-off-by: Desi Hsu <[email protected]> * dco Signed-off-by: Desi Hsu <[email protected]> * dco Signed-off-by: Desi Hsu <[email protected]> * typo Signed-off-by: Desi Hsu <[email protected]> --------- Signed-off-by: Desi Hsu <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Correct "sucessfile" to "successfile" (#5652) Signed-off-by: Bugra Gedik <[email protected]> * Fix ordering for custom template values in cluster resource controller (#5648) Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * Don't error when attempting to trigger schedules for inactive projects (#5649) * Don't error when attempting to trigger schedules for inactive projects Signed-off-by: Katrina Rogan <[email protected]> * regen Signed-off-by: Katrina Rogan <[email protected]> --------- Signed-off-by: Katrina Rogan <[email protected]> Signed-off-by: Bugra Gedik <[email protected]> * fix tests Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * change to shorter names Signed-off-by: Bugra Gedik <[email protected]> * Fix comment symbol Signed-off-by: Eduardo Apolinario <[email protected]> * fix one more test Signed-off-by: Bugra Gedik <[email protected]> --------- Signed-off-by: Bugra Gedik <[email protected]> Signed-off-by: Jason Parraga <[email protected]> Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: ddl-rliu <[email protected]> Signed-off-by: Yee Hing Tong <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Andrew Dye <[email protected]> Signed-off-by: Christina <[email protected]> Signed-off-by: Desi Hsu <[email protected]> Signed-off-by: Katrina Rogan <[email protected]> Co-authored-by: Jason Parraga <[email protected]> Co-authored-by: Future-Outlier <[email protected]> Co-authored-by: pingsutw <[email protected]> Co-authored-by: ddl-rliu <[email protected]> Co-authored-by: Yee Hing Tong <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Co-authored-by: Andrew Dye <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Co-authored-by: Christina <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: desihsu <[email protected]> Co-authored-by: ShengYu <[email protected]> Co-authored-by: Katrina Rogan <[email protected]> Signed-off-by: pmahindrakar-oss <[email protected]>
What changes were proposed in this pull request?
Previously all gRPC handlers would handle panics inside each RPC handler. This added a lot of repetitive boilerplate to all RPC handlers that was pretty fragile to maintain. This pull request introduces recovery middleware that will recover from panics for all RPCs mounted to the RPC server.
This pull request also proposes a change to the panic recovery logic.
I made a change to the recovery logic such that it logs the panic at the error level instead of the fatal level. The previous fatal error level would call
os.Exit(1)
which immediately terminates the program ungracefully. My suspicion is that this made the existing prometheus panic metrics effectively useless given that prometheus metrics are polled on an interval and the server was likely killed when the metrics would normally be polled. (Arguably the panic metrics could be removed now).IMO, it's better for high availability to have an RPC server that is alive and sending errors back (and reporting error metrics) than one that gets killed and is unresponsive until Kubernetes decides to boot up another healthy pod. As such, I have changed the behavior to return gRPC INTERNAL status codes instead of terminating the server. I'm open to debating this change so feel free to share your opinion.
How was this patch tested?
Unit tests
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link