-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "stale workflow" scanner/fixer impl, to remove data beyond reasonable retention #5361
Conversation
d474273
to
73db0d9
Compare
7625c8e
to
c15bc03
Compare
…r manager, it was using concrete?!
44e898d
to
f929057
Compare
ee01cc4
to
e0ba678
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Overall the testing you did certainly makes me a lot more confident in the overall approach, though looked extremely annoying.
If you'd done the work already I'd suggest not doing the backwards-compat, but no strong feelings that it's there, though I would be awary introducing further complexity or working on the assumption or just set to terminate-if-running
This was accidentally broken in cadence-workflow#5361 - this timer-fixer uses the shared fixer workflow, which now expects non-empty config about what invariants are enabled. Since that'll help make this clearer / less "enable by default"-prone for future timer-invariants, I've just added a static non-empty config.
This was accidentally broken in #5361 - this timer-fixer uses the shared fixer workflow, which now expects non-empty config about what invariants are enabled. Since that'll help make this clearer / less "enable by default"-prone for future timer-invariants, I've just added a static non-empty config.
#5361 unfortunately broke timer-fixers, as the shared workflow requires a non-empty "enabled" list and none was provided in that PR. #5433 restores that functionality. --- Arguably `v1.2.5` could be retracted, as it introduces a potentially-problematic bug. Since this fixer is disabled by default, we're going to skip doing that: we would have to publish `v1.2.6` to put the retraction into effect, and anyone updating that rapidly will not have much down-time. Plus the fixer is not required for correct behavior.
For a variety of reasons, we have workflows sitting in some of our databases far beyond any reasonable time range.
E.g. a workflow completed months ago in a domain with 7 day retention -> unfortunately still retained.
We've been gradually finding and fixing problems that cause this, but the old data needs to be cleaned up
at some point. They can sometimes cause problems, particularly if they appear to be "running".
So this is a conservative scanner + fixer for workflows that meet these criteria.
It does not handle all states, nor does it support domains configured with >200 day retention (hardcoded but
trivially changeable, as a sanity check), but it's catching a large amount internally and seems safe to share
and run in more locations.
Since building and verifying this required figuring out the scanner/fixer code in detail,
and that proved to be a significant effort, this also includes three major additions on top
of the stale scanner: