-
Notifications
You must be signed in to change notification settings - Fork 265
Improve update and savepoint handling #420
base: master
Are you sure you want to change the base?
Conversation
Thanks for the PRs! Let me know when it is ready for review. |
@elanv Thank you for doing this! you are awesome :) Also, I see that in the takeSavepointOnUpgrade flow I added (nice rename BTW) it takes the SP synchronously. And I saw that while it happends the operator fails to submit new jobclsuters (it just waits for the SP to finish then it submits them) Do you think its something we can change as well? or it proves to difficult ATM? last thing, do you thinks its possible to leverage the new SavepointMaxAgeToRestoreSeconds (or a new field) to make the job submitter fail if not SP is provided for it? (To reduce the chance for an error in deployment where a job that must start from a SP does not) |
I plan to write additional unit test code while doing enough tests. The diagram of the new job states applied to new commit is also attached to the PR description. |
@shashken Sorry for late response. The savepoint issue is specific to the auto savepoint feature. When auto savepoint is enabled, as soon as job starts the operator triggers a savepoint, but the savepoint fails if some tasks of the Flink job did not started. In that case, the savepoint status is updated as failed, and immediatley the next iteration of the reconciliation loop starts again because the status change triggers it. But in the following iterations, it is likely that the previous failing routine is still repeated because there wasn't enough time between the iterations. Basically the savepoint should not be triggered as soon as job starts, therefore, I have fixed it in this PR to trigger the first savepoint after provided time interval passed. I do not understand still in which case
|
@elanv Regarding the
|
- fix handling failed auto savepoint - fix validations and tests related to changes - improve update routine - change the behavior of handling unexpected jobs - add a constraint for update: when takeSavepointOnUpdate is true, latest savepoint age should be less than maxStateAgeToRestore
@functicons Could you review this PR? I think completed this work almost. Now, it would be nice to add the documentation and tests while proceeding with the review. |
@shashken I have assumed that the job must be resumed from its last state when update is triggered . So I thought And in the last commit, I applied |
Purpose of this PR
Currently, savepoint and its related routines are scattered in several places. It make difficult to enhance this operator now. This PR organizes them so that savepoint-related routines can be improved and extended in the future. It also improves the update, cancel and recovery features that depend on savepoint routines.
Changes
Details
Organize and fix savepoint routine
Add field to configure the latest savepoint age for updateSavepointMaxAgeForUpdateSeconds
lastSavepointTriggerTime
andlastSavepointTriggerID
: duplicated withstatus.savepoint
status.job.startTime
and deleteSavepointTriggerReasonScheduledInitial
Change job stop behavior when updating and cancelling a job
Improve update process
takeSavepointOnUpdate
fieldtakeSavepointOnUpgrade
totakeSavepointOnUpdate
takeSavepointOnUpdate
Elaborate update/restart strategy
Limit job state age from which job can be restarted when auto restarting from failure, updating stopped job and updating running job wiith
takeSavepointOnUpdate
falseMaxStateAgeToRestoreSeconds
Add new job deployment states