Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of fsm: fix bug in snapshot restore for removed timetable into release/1.9.x #24417

Merged

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #24412 to be assessed for backporting due to the inclusion of the label backport/1.9.x.

The below text is copied from the body of the original PR.


When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade).

The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them.

While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry.

Fixes: #24411


To minimally test:

  • Run Nomad 1.9.1 or earlier (a single node cluster is ok, but not -dev mode).
  • Run `nomad operator snapshot save /tmp/snapshot.tar.gz
  • Run nomad operator root keyring list to see the current keyring (a good standin to see that the snapshot has been restored)
  • Stop the agent
  • Start the agent with this branch.
  • Verify that you do not see the "initializing keyring" log line.
  • Run nomad operator root keyring list again to see that the same key has been restored.

We should do some more extensive upgrade testing as well before merging this.


Overview of commits

@pkazmierczak pkazmierczak merged commit 8107454 into release/1.9.x Nov 11, 2024
20 checks passed
@pkazmierczak pkazmierczak deleted the backport/timetable-fix/gradually-gentle-gannet branch November 11, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants