Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsm: fix bug in snapshot restore for removed timetable #24412

Merged
merged 1 commit into from
Nov 11, 2024
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Nov 9, 2024

When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade).

The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them.

While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry.

Fixes: #24411


To minimally test:

  • Run Nomad 1.9.1 or earlier (a single node cluster is ok, but not -dev mode).
  • Run `nomad operator snapshot save /tmp/snapshot.tar.gz
  • Run nomad operator root keyring list to see the current keyring (a good standin to see that the snapshot has been restored)
  • Stop the agent
  • Start the agent with this branch.
  • Verify that you do not see the "initializing keyring" log line.
  • Run nomad operator root keyring list again to see that the same key has been restored.

We should do some more extensive upgrade testing as well before merging this.

When we removed the time table in #24112 we introduced a bug where if a previous
version of Nomad had written a time table entry, we'd return from the restore
loop early and never load the rest of the FSM. This will result in a mostly or
partially wiped state for that Nomad node, which would then be out of sync with
its peers (which would also have the same problem on upgrade).

The bug only occurs when the FSM is being restored from snapshot, which isn't
the case if you test with a server that's only written Raft logs and not
snapshotted them.

While fixing this bug, we still need to ensure we're reading the time table
entries even if we're throwing them away, so that we move the snapshot reader
along to the next full entry.

Fixes: #24411
@tgross tgross added this to the 1.9.x milestone Nov 9, 2024
@tgross tgross added the backport/1.9.x backport to 1.9.x release line label Nov 9, 2024
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this over the weekend @tgross! I am approving based on what my testing has shown and the code change. I would suggest merging waits until someone more familiar with the change has also tested and reviewed.

The tests used a single server agent (not in dev mode) locally on macOS which was started and had it's ACL system bootstrapped. I then used the following script to add some objects with the corresponding ACL policy file:

setup.sh
#!/usr/bin/env bash

set -o errexit

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

# Create test namespaces.
nomad namespace apply -description "test namespace" test-1
nomad namespace apply -description "test namespace" test-2
nomad namespace apply -description "test namespace" test-3
nomad namespace apply -description "test namespace" test-4
nomad namespace apply -description "test namespace" test-5
nomad namespace apply -description "test namespace" test-6
nomad namespace apply -description "test namespace" test-7
nomad namespace apply -description "test namespace" test-8
nomad namespace apply -description "test namespace" test-9

# Create test ACL policies.
nomad acl policy apply -description "test acl policy" test-1 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-2 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-3 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-4 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-5 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-6 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-7 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-8 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-9 "$SCRIPT_DIR"/_test_acl_policy.hcl

# Create client ACL tokens.
nomad acl token create -name="test client acl token" -policy=test-1 -type=client
nomad acl token create -name="test client acl token" -policy=test-2 -type=client
nomad acl token create -name="test client acl token" -policy=test-3 -type=client
nomad acl token create -name="test client acl token" -policy=test-4 -type=client
nomad acl token create -name="test client acl token" -policy=test-5 -type=client
nomad acl token create -name="test client acl token" -policy=test-6 -type=client
nomad acl token create -name="test client acl token" -policy=test-7 -type=client
nomad acl token create -name="test client acl token" -policy=test-8 -type=client
nomad acl token create -name="test client acl token" -policy=test-9 -type=client

# Create management ACL tokens.
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
_test_acl_policy.hcl
namespace "*" {
  policy = "write"
}

agent {
  policy = "read"
}

node {
  policy = "read"
}

quota {
  policy = "read"
}

Once the script had been run successfully, I ran nomad operator snapshot save to perform a snapshot of the server state. The data directory of the server was wiped clean between the tests.

Nomad v1.9.1 to v1.9.2

Once the Nomad server was upgraded, no state was available to the extent the ACL system required bootstrapping.

Nomad v1.9.1 to timetable-fix (a67fc1f)

Once the Nomad server was upgraded, all state was available and confirmed via the CLI listing each object set created using the script.

Manual Snapshot Restore v1.9.2

I took a fresh Nomad server and restored a previously taken snapshot which did not result in state restoration.

Manual Snapshot Restore timetable-fix (a67fc1f)

I took a fresh Nomad server and restored a previously taken snapshot which populated all the expected state and was confirmed via the CLI listing each object set created using the script.

Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did similar testing to @jrasell, testing upgrades from 1.7.7 and 1.8.4 to a67fc1f. State was successfully restored.

Reading through #24112, the only place that touches the snapshot restore code path is indeed restoreImpl method in the FSM. Core scheduler changes only apply to GC methods, same as BlockedEvals.

@tgross
Copy link
Member Author

tgross commented Nov 11, 2024

It occurs to me that the EventSinkSnapshot has the exact same bug. This type didn't ship in GA code so the snapshot type only exists for clusters that had snapshots from 1.0-beta.1, but not 1.0 GA, so we should never see that snapshot type on any real cluster. We should be returning an error there instead of nil, because the only way to get that is corruption. But given that bug has been lurking for 4 years, I'm not going to touch it in this emergency patch. I'll follow-up with a PR for that later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.9.x backport to 1.9.x release line theme/raft type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1.9.2 upgrade results in wiped state
3 participants