Add a "stale workflow" scanner/fixer impl, to remove data beyond reasonable retention #5361

Groxx · 2023-07-25T22:13:08Z

For a variety of reasons, we have workflows sitting in some of our databases far beyond any reasonable time range.
E.g. a workflow completed months ago in a domain with 7 day retention -> unfortunately still retained.

We've been gradually finding and fixing problems that cause this, but the old data needs to be cleaned up
at some point. They can sometimes cause problems, particularly if they appear to be "running".

So this is a conservative scanner + fixer for workflows that meet these criteria.
It does not handle all states, nor does it support domains configured with >200 day retention (hardcoded but
trivially changeable, as a sanity check), but it's catching a large amount internally and seems safe to share
and run in more locations.

Since building and verifying this required figuring out the scanner/fixer code in detail,
and that proved to be a significant effort, this also includes three major additions on top
of the stale scanner:

there have been a variety of renames and public -> private changes to make naming consistent
- this helped discover a significant issue with the current execution scanner, so I'm keeping it
concrete execution fixer invariants can now be individually enabled and disabled via dynamic config
a new readme.md file that attempts to summarize the learnings, and how to config and verify scanner/fixer.

coveralls · 2023-07-25T22:52:40Z

Pull Request Test Coverage Report for Build 01899039-7b38-4151-afff-08a098661ffd

273 of 495 (55.15%) changed or added relevant lines in 8 files are covered.
45 unchanged lines in 13 files lost coverage.
Overall coverage increased (+0.07%) to 57.115%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
common/reconciliation/invariant/collection_enumer.go	0	1	0.0%
service/worker/scanner/executions/current_execution.go	0	1	0.0%
service/worker/scanner/executions/types.go	0	5	0.0%
service/worker/scanner/executions/concrete_execution.go	0	7	0.0%
tools/cli/adminDBCleanCommand.go	0	8	0.0%
tools/cli/adminDBScanCommand.go	0	8	0.0%
common/reconciliation/invariant/staleWorkflow.go	266	458	58.08%

Files with Coverage Reduction	New Missed Lines	%
common/reconciliation/invariant/collection_enumer.go	1	0%
service/worker/scanner/executions/concrete_execution.go	1	36.36%
common/persistence/sql/sqlplugin/postgres/db.go	2	85.0%
common/task/fifoTaskScheduler.go	2	84.54%
common/task/weightedRoundRobinTaskScheduler.go	2	89.12%
common/util.go	2	53.44%
service/history/handler.go	2	47.28%
service/matching/taskListManager.go	3	80.89%
common/types/history.go	4	45.35%
service/history/historyEngine.go	4	68.79%

Totals
Change from base Build 01898906-4fcf-4fd0-9203-586f0c269abc:	0.07%
Covered Lines:	87519
Relevant Lines:	153234

💛 - Coveralls

tools/cli/flags.go

common/reconciliation/invariant/staleWorkflow.go

service/worker/scanner/executions/types.go

service/worker/scanner/README.md

common/reconciliation/invariant/staleWorkflow.go

…onable retention

…r manager, it was using concrete?!

davidporter-id-au

LGTM.

Overall the testing you did certainly makes me a lot more confident in the overall approach, though looked extremely annoying.

If you'd done the work already I'd suggest not doing the backwards-compat, but no strong feelings that it's there, though I would be awary introducing further complexity or working on the assumption or just set to terminate-if-running

CHANGELOG.md

service/worker/scanner/executions/concrete_execution.go

service/worker/scanner/executions/current_execution.go

service/worker/scanner/shardscanner/activities.go

service/worker/scanner/shardscanner/fixer_workflow.go

service/worker/scanner/README.md

…licity

This was accidentally broken in cadence-workflow#5361 - this timer-fixer uses the shared fixer workflow, which now expects non-empty config about what invariants are enabled. Since that'll help make this clearer / less "enable by default"-prone for future timer-invariants, I've just added a static non-empty config.

This was accidentally broken in #5361 - this timer-fixer uses the shared fixer workflow, which now expects non-empty config about what invariants are enabled. Since that'll help make this clearer / less "enable by default"-prone for future timer-invariants, I've just added a static non-empty config.

#5361 unfortunately broke timer-fixers, as the shared workflow requires a non-empty "enabled" list and none was provided in that PR. #5433 restores that functionality. --- Arguably `v1.2.5` could be retracted, as it introduces a potentially-problematic bug. Since this fixer is disabled by default, we're going to skip doing that: we would have to publish `v1.2.6` to put the retraction into effect, and anyone updating that rapidly will not have much down-time. Plus the fixer is not required for correct behavior.

Groxx marked this pull request as ready for review July 25, 2023 22:15

Groxx commented Jul 25, 2023

View reviewed changes

tools/cli/flags.go Show resolved Hide resolved

davidporter-id-au reviewed Jul 25, 2023

View reviewed changes

common/reconciliation/invariant/staleWorkflow.go Show resolved Hide resolved

service/worker/scanner/executions/types.go Show resolved Hide resolved

Groxx force-pushed the stale-workflow-fixer branch from d474273 to 73db0d9 Compare August 29, 2023 19:05

Groxx force-pushed the stale-workflow-fixer branch 6 times, most recently from 7625c8e to c15bc03 Compare September 6, 2023 02:01