-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We now wait 10 seconds before we start returning shard closed errors, also stop retrying on shard closed errors #5938
We now wait 10 seconds before we start returning shard closed errors, also stop retrying on shard closed errors #5938
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
... and 6 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
Shard closing is not an unexpected state, so we should not emit error logs and metrics on this. However if we stay in a state where a closed shard keeps getting requests. Then we should start emitting error logs and metrics.
This reverts commit f2687b6.
… information on when the task timed out as well
0b95a40
to
f313fa5
Compare
What changed?
Deleted unreachable instances of the guard. These were introduced in Update shard context to reduce DB calls for closed shards #4547 when the checks were in a loop that was accessing the DB, which was able to change the shard's closed state in each iterationWhy?
Shard closing is not an unexpected state, so we should not emit error logs and metrics for this.
If we stay in a state where a closed shard keeps getting requests, then we should start emitting error logs and metrics, so we wait 10 seconds, and if we still see requests then we start emitting the metrics.
All the guard testing and deleting is necessary to make the new line coverage check happy.
How did you test it?
Tested with unit tests and by deploying to staging. The deployment shows we can now do restarts without seeing these errors.
Potential risks
This does change some relatively core task processing logic, however the main change is on how the error states are communicated. The main flow is not touched.
Release notes
Documentation Changes