Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "stale workflow" scanner/fixer impl, to remove data beyond reasonable retention #5361

Merged
merged 13 commits into from
Oct 3, 2023

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Jul 25, 2023

For a variety of reasons, we have workflows sitting in some of our databases far beyond any reasonable time range.
E.g. a workflow completed months ago in a domain with 7 day retention -> unfortunately still retained.

We've been gradually finding and fixing problems that cause this, but the old data needs to be cleaned up
at some point. They can sometimes cause problems, particularly if they appear to be "running".

So this is a conservative scanner + fixer for workflows that meet these criteria.
It does not handle all states, nor does it support domains configured with >200 day retention (hardcoded but
trivially changeable, as a sanity check), but it's catching a large amount internally and seems safe to share
and run in more locations.


Since building and verifying this required figuring out the scanner/fixer code in detail,
and that proved to be a significant effort, this also includes three major additions on top
of the stale scanner:

  • there have been a variety of renames and public -> private changes to make naming consistent
    • this helped discover a significant issue with the current execution scanner, so I'm keeping it
  • concrete execution fixer invariants can now be individually enabled and disabled via dynamic config
  • a new readme.md file that attempts to summarize the learnings, and how to config and verify scanner/fixer.

@Groxx Groxx marked this pull request as ready for review July 25, 2023 22:15
@coveralls
Copy link

coveralls commented Jul 25, 2023

Pull Request Test Coverage Report for Build 01899039-7b38-4151-afff-08a098661ffd

  • 273 of 495 (55.15%) changed or added relevant lines in 8 files are covered.
  • 45 unchanged lines in 13 files lost coverage.
  • Overall coverage increased (+0.07%) to 57.115%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/reconciliation/invariant/collection_enumer.go 0 1 0.0%
service/worker/scanner/executions/current_execution.go 0 1 0.0%
service/worker/scanner/executions/types.go 0 5 0.0%
service/worker/scanner/executions/concrete_execution.go 0 7 0.0%
tools/cli/adminDBCleanCommand.go 0 8 0.0%
tools/cli/adminDBScanCommand.go 0 8 0.0%
common/reconciliation/invariant/staleWorkflow.go 266 458 58.08%
Files with Coverage Reduction New Missed Lines %
common/reconciliation/invariant/collection_enumer.go 1 0%
service/worker/scanner/executions/concrete_execution.go 1 36.36%
common/persistence/sql/sqlplugin/postgres/db.go 2 85.0%
common/task/fifoTaskScheduler.go 2 84.54%
common/task/weightedRoundRobinTaskScheduler.go 2 89.12%
common/util.go 2 53.44%
service/history/handler.go 2 47.28%
service/matching/taskListManager.go 3 80.89%
common/types/history.go 4 45.35%
service/history/historyEngine.go 4 68.79%
Totals Coverage Status
Change from base Build 01898906-4fcf-4fd0-9203-586f0c269abc: 0.07%
Covered Lines: 87519
Relevant Lines: 153234

💛 - Coveralls

@Groxx Groxx force-pushed the stale-workflow-fixer branch 6 times, most recently from 7625c8e to c15bc03 Compare September 6, 2023 02:01
service/worker/scanner/README.md Outdated Show resolved Hide resolved
service/worker/scanner/README.md Outdated Show resolved Hide resolved
service/worker/scanner/README.md Outdated Show resolved Hide resolved
service/worker/scanner/README.md Show resolved Hide resolved
Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Overall the testing you did certainly makes me a lot more confident in the overall approach, though looked extremely annoying.

If you'd done the work already I'd suggest not doing the backwards-compat, but no strong feelings that it's there, though I would be awary introducing further complexity or working on the assumption or just set to terminate-if-running

CHANGELOG.md Show resolved Hide resolved
service/worker/scanner/executions/concrete_execution.go Outdated Show resolved Hide resolved
service/worker/scanner/README.md Show resolved Hide resolved
service/worker/scanner/README.md Show resolved Hide resolved
@Groxx Groxx merged commit 4a26eea into cadence-workflow:master Oct 3, 2023
@Groxx Groxx deleted the stale-workflow-fixer branch October 3, 2023 00:07
Groxx added a commit to Groxx/cadence that referenced this pull request Oct 27, 2023
This was accidentally broken in cadence-workflow#5361 - this timer-fixer uses the shared
fixer workflow, which now expects non-empty config about what invariants
are enabled.

Since that'll help make this clearer / less "enable by default"-prone for
future timer-invariants, I've just added a static non-empty config.
Groxx added a commit that referenced this pull request Nov 14, 2023
This was accidentally broken in #5361 - this timer-fixer uses the shared
fixer workflow, which now expects non-empty config about what invariants
are enabled.

Since that'll help make this clearer / less "enable by default"-prone for
future timer-invariants, I've just added a static non-empty config.
Groxx added a commit that referenced this pull request Nov 15, 2023
#5361 unfortunately broke timer-fixers, as the shared workflow requires a non-empty "enabled" list and none was provided in that PR.

#5433 restores that functionality.

---

Arguably `v1.2.5` could be retracted, as it introduces a potentially-problematic bug.

Since this fixer is disabled by default, we're going to skip doing that: we would have to publish `v1.2.6` to put the retraction into effect, and anyone updating that rapidly will not have much down-time.  Plus the fixer is not required for correct behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants