-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsm: fix bug in snapshot restore for removed timetable #24412
Conversation
When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade). The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them. While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry. Fixes: #24411
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this over the weekend @tgross! I am approving based on what my testing has shown and the code change. I would suggest merging waits until someone more familiar with the change has also tested and reviewed.
The tests used a single server agent (not in dev mode) locally on macOS which was started and had it's ACL system bootstrapped. I then used the following script to add some objects with the corresponding ACL policy file:
setup.sh
#!/usr/bin/env bash
set -o errexit
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
# Create test namespaces.
nomad namespace apply -description "test namespace" test-1
nomad namespace apply -description "test namespace" test-2
nomad namespace apply -description "test namespace" test-3
nomad namespace apply -description "test namespace" test-4
nomad namespace apply -description "test namespace" test-5
nomad namespace apply -description "test namespace" test-6
nomad namespace apply -description "test namespace" test-7
nomad namespace apply -description "test namespace" test-8
nomad namespace apply -description "test namespace" test-9
# Create test ACL policies.
nomad acl policy apply -description "test acl policy" test-1 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-2 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-3 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-4 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-5 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-6 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-7 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-8 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-9 "$SCRIPT_DIR"/_test_acl_policy.hcl
# Create client ACL tokens.
nomad acl token create -name="test client acl token" -policy=test-1 -type=client
nomad acl token create -name="test client acl token" -policy=test-2 -type=client
nomad acl token create -name="test client acl token" -policy=test-3 -type=client
nomad acl token create -name="test client acl token" -policy=test-4 -type=client
nomad acl token create -name="test client acl token" -policy=test-5 -type=client
nomad acl token create -name="test client acl token" -policy=test-6 -type=client
nomad acl token create -name="test client acl token" -policy=test-7 -type=client
nomad acl token create -name="test client acl token" -policy=test-8 -type=client
nomad acl token create -name="test client acl token" -policy=test-9 -type=client
# Create management ACL tokens.
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
_test_acl_policy.hcl
namespace "*" {
policy = "write"
}
agent {
policy = "read"
}
node {
policy = "read"
}
quota {
policy = "read"
}
Once the script had been run successfully, I ran nomad operator snapshot save
to perform a snapshot of the server state. The data directory of the server was wiped clean between the tests.
Nomad v1.9.1
to v1.9.2
Once the Nomad server was upgraded, no state was available to the extent the ACL system required bootstrapping.
Nomad v1.9.1
to timetable-fix
(a67fc1f)
Once the Nomad server was upgraded, all state was available and confirmed via the CLI listing each object set created using the script.
Manual Snapshot Restore v1.9.2
I took a fresh Nomad server and restored a previously taken snapshot which did not result in state restoration.
Manual Snapshot Restore timetable-fix
(a67fc1f)
I took a fresh Nomad server and restored a previously taken snapshot which populated all the expected state and was confirmed via the CLI listing each object set created using the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did similar testing to @jrasell, testing upgrades from 1.7.7 and 1.8.4 to a67fc1f. State was successfully restored.
Reading through #24112, the only place that touches the snapshot restore code path is indeed restoreImpl
method in the FSM. Core scheduler changes only apply to GC methods, same as BlockedEvals
.
It occurs to me that the |
When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade).
The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them.
While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry.
Fixes: #24411
To minimally test:
-dev
mode).nomad operator root keyring list
to see the current keyring (a good standin to see that the snapshot has been restored)nomad operator root keyring list
again to see that the same key has been restored.We should do some more extensive upgrade testing as well before merging this.