Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We now wait 10 seconds before we start returning shard closed errors, also stop retrying on shard closed errors #5938

Merged
merged 13 commits into from
Apr 29, 2024

Conversation

jakobht
Copy link
Member

@jakobht jakobht commented Apr 25, 2024

What changed?

  • Introduced a shardRecentlyClosed error to signal from the shard that it was recently closed. This error will not cause the task handler to emit error logs and metrics.
  • Add a test that hits this guard in all the places it exists
  • Deleted unreachable instances of the guard. These were introduced in Update shard context to reduce DB calls for closed shards #4547 when the checks were in a loop that was accessing the DB, which was able to change the shard's closed state in each iteration
  • Removed redundant tests that checked the same as the new guard test now checks
  • Stop retrying on shard closed errors

Why?
Shard closing is not an unexpected state, so we should not emit error logs and metrics for this.

If we stay in a state where a closed shard keeps getting requests, then we should start emitting error logs and metrics, so we wait 10 seconds, and if we still see requests then we start emitting the metrics.

All the guard testing and deleting is necessary to make the new line coverage check happy.

How did you test it?
Tested with unit tests and by deploying to staging. The deployment shows we can now do restarts without seeing these errors.

Potential risks
This does change some relatively core task processing logic, however the main change is on how the error states are communicated. The main flow is not touched.

Release notes

Documentation Changes

Copy link

codecov bot commented Apr 25, 2024

Codecov Report

Attention: Patch coverage is 67.34694% with 16 lines in your changes are missing coverage. Please review.

Project coverage is 62.77%. Comparing base (969a6c6) to head (f313fa5).

❗ Current head f313fa5 differs from pull request most recent head ccf6dce. Consider uploading reports for the commit ccf6dce to get more accurate results

Additional details and impacted files
Files Coverage Δ
service/history/task/task.go 79.12% <100.00%> (+0.41%) ⬆️
...ervice/history/queue/timer_queue_processor_base.go 70.00% <50.00%> (+0.07%) ⬆️
...ice/history/queue/transfer_queue_processor_base.go 63.34% <50.00%> (+0.10%) ⬆️
...istory/queue/cross_cluster_queue_processor_base.go 41.98% <0.00%> (-0.10%) ⬇️
service/history/queue/timer_queue_processor.go 0.00% <0.00%> (ø)
service/history/queue/transfer_queue_processor.go 36.09% <0.00%> (-0.10%) ⬇️
service/history/shard/context.go 33.82% <76.47%> (+3.51%) ⬆️

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 969a6c6...ccf6dce. Read the comment docs.

@coveralls
Copy link

coveralls commented Apr 25, 2024

Pull Request Test Coverage Report for Build 018f28ec-ab6c-48ff-b477-b9b0bbf1424d

Details

  • 45 of 53 (84.91%) changed or added relevant lines in 7 files are covered.
  • 43 unchanged lines in 10 files lost coverage.
  • Overall coverage increased (+0.04%) to 67.829%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/queue/timer_queue_processor.go 0 2 0.0%
service/history/queue/transfer_queue_processor.go 0 2 0.0%
service/history/shard/context.go 33 37 89.19%
Files with Coverage Reduction New Missed Lines %
tools/cli/admin_db_decode_thrift.go 1 71.79%
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/task/fifo_task_scheduler.go 2 87.63%
service/frontend/api/handler.go 2 62.07%
service/history/execution/mutable_state_util.go 2 78.52%
service/history/task/fetcher.go 3 86.6%
service/history/shard/context.go 4 69.08%
common/archiver/filestore/historyArchiver.go 4 80.95%
service/frontend/wrappers/metered/metered.go 9 63.18%
service/history/execution/mutable_state_task_refresher.go 14 67.09%
Totals Coverage Status
Change from base Build 018f1f09-35ae-4fac-9e3c-30bf2c31a81f: 0.04%
Covered Lines: 99576
Relevant Lines: 146805

💛 - Coveralls

@jakobht jakobht changed the title We now wait 10 seconds before we start returning shard closed errors We now wait 10 seconds before we start returning shard closed errors, also stop retrying on shard closed errors Apr 29, 2024
@jakobht jakobht merged commit 6660bec into cadence-workflow:master Apr 29, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants