Backport of fsm: fix bug in snapshot restore for removed timetable into release/1.9.x #24417
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #24412 to be assessed for backporting due to the inclusion of the label backport/1.9.x.
The below text is copied from the body of the original PR.
When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade).
The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them.
While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry.
Fixes: #24411
To minimally test:
-dev
mode).nomad operator root keyring list
to see the current keyring (a good standin to see that the snapshot has been restored)nomad operator root keyring list
again to see that the same key has been restored.We should do some more extensive upgrade testing as well before merging this.
Overview of commits