fsm: fix bug in snapshot restore for removed timetable #24412

tgross · 2024-11-09T21:53:56Z

When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade).

The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them.

While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry.

Fixes: #24411

To minimally test:

Run Nomad 1.9.1 or earlier (a single node cluster is ok, but not -dev mode).
Run `nomad operator snapshot save /tmp/snapshot.tar.gz
Run nomad operator root keyring list to see the current keyring (a good standin to see that the snapshot has been restored)
Stop the agent
Start the agent with this branch.
Verify that you do not see the "initializing keyring" log line.
Run nomad operator root keyring list again to see that the same key has been restored.

We should do some more extensive upgrade testing as well before merging this.

When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade). The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them. While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry. Fixes: #24411

jrasell

Thanks for looking into this over the weekend @tgross! I am approving based on what my testing has shown and the code change. I would suggest merging waits until someone more familiar with the change has also tested and reviewed.

The tests used a single server agent (not in dev mode) locally on macOS which was started and had it's ACL system bootstrapped. I then used the following script to add some objects with the corresponding ACL policy file:

setup.sh

#!/usr/bin/env bash

set -o errexit

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

# Create test namespaces.
nomad namespace apply -description "test namespace" test-1
nomad namespace apply -description "test namespace" test-2
nomad namespace apply -description "test namespace" test-3
nomad namespace apply -description "test namespace" test-4
nomad namespace apply -description "test namespace" test-5
nomad namespace apply -description "test namespace" test-6
nomad namespace apply -description "test namespace" test-7
nomad namespace apply -description "test namespace" test-8
nomad namespace apply -description "test namespace" test-9

# Create test ACL policies.
nomad acl policy apply -description "test acl policy" test-1 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-2 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-3 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-4 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-5 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-6 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-7 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-8 "$SCRIPT_DIR"/_test_acl_policy.hcl
nomad acl policy apply -description "test acl policy" test-9 "$SCRIPT_DIR"/_test_acl_policy.hcl

# Create client ACL tokens.
nomad acl token create -name="test client acl token" -policy=test-1 -type=client
nomad acl token create -name="test client acl token" -policy=test-2 -type=client
nomad acl token create -name="test client acl token" -policy=test-3 -type=client
nomad acl token create -name="test client acl token" -policy=test-4 -type=client
nomad acl token create -name="test client acl token" -policy=test-5 -type=client
nomad acl token create -name="test client acl token" -policy=test-6 -type=client
nomad acl token create -name="test client acl token" -policy=test-7 -type=client
nomad acl token create -name="test client acl token" -policy=test-8 -type=client
nomad acl token create -name="test client acl token" -policy=test-9 -type=client

# Create management ACL tokens.
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management
nomad acl token create -name="test management acl token" -type=management

_test_acl_policy.hcl

namespace "*" {
  policy = "write"
}

agent {
  policy = "read"
}

node {
  policy = "read"
}

quota {
  policy = "read"
}

Once the script had been run successfully, I ran nomad operator snapshot save to perform a snapshot of the server state. The data directory of the server was wiped clean between the tests.

Nomad `v1.9.1` to `v1.9.2`

Once the Nomad server was upgraded, no state was available to the extent the ACL system required bootstrapping.

Nomad `v1.9.1` to `timetable-fix` (`a67fc1f`)

Once the Nomad server was upgraded, all state was available and confirmed via the CLI listing each object set created using the script.

Manual Snapshot Restore `v1.9.2`

I took a fresh Nomad server and restored a previously taken snapshot which did not result in state restoration.

Manual Snapshot Restore `timetable-fix` (`a67fc1f`)

I took a fresh Nomad server and restored a previously taken snapshot which populated all the expected state and was confirmed via the CLI listing each object set created using the script.

pkazmierczak

I did similar testing to @jrasell, testing upgrades from 1.7.7 and 1.8.4 to a67fc1f. State was successfully restored.

Reading through #24112, the only place that touches the snapshot restore code path is indeed restoreImpl method in the FSM. Core scheduler changes only apply to GC methods, same as BlockedEvals.

tgross · 2024-11-11T12:58:10Z

It occurs to me that the EventSinkSnapshot has the exact same bug. This type didn't ship in GA code so the snapshot type only exists for clusters that had snapshots from 1.0-beta.1, but not 1.0 GA, so we should never see that snapshot type on any real cluster. We should be returning an error there instead of nil, because the only way to get that is corruption. But given that bug has been lurking for 4 years, I'm not going to touch it in this emergency patch. I'll follow-up with a PR for that later.

tgross added this to the 1.9.x milestone Nov 9, 2024

tgross added type/bug theme/raft labels Nov 9, 2024

tgross mentioned this pull request Nov 9, 2024

1.9.2 upgrade results in wiped state #24411

Closed

tgross requested review from pkazmierczak, schmichael and jrasell November 9, 2024 21:56

tgross added the backport/1.9.x backport to 1.9.x release line label Nov 9, 2024

jrasell approved these changes Nov 11, 2024

View reviewed changes

pkazmierczak approved these changes Nov 11, 2024

View reviewed changes

tgross merged commit 6459232 into main Nov 11, 2024
35 checks passed

tgross deleted the timetable-fix branch November 11, 2024 13:01

hc-github-team-nomad-core mentioned this pull request Nov 11, 2024

Backport of fsm: fix bug in snapshot restore for removed timetable into release/1.9.x #24417

Merged

This was referenced Nov 11, 2024

fsm: handle EventSinkSnapshot type correctly #24418

Merged

changelog entry for 24412 #24421

Merged

This was referenced Nov 11, 2024

Backport of changelog entry for 24412 into release/1.9.x #24422

Merged

Backport of fsm: handle EventSinkSnapshot type correctly into release/1.9.x #24440

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsm: fix bug in snapshot restore for removed timetable #24412

fsm: fix bug in snapshot restore for removed timetable #24412

tgross commented Nov 9, 2024

jrasell left a comment

pkazmierczak left a comment

tgross commented Nov 11, 2024 •

edited

Loading

fsm: fix bug in snapshot restore for removed timetable #24412

fsm: fix bug in snapshot restore for removed timetable #24412

Conversation

tgross commented Nov 9, 2024

jrasell left a comment

Choose a reason for hiding this comment

Nomad v1.9.1 to v1.9.2

Nomad v1.9.1 to timetable-fix (a67fc1f)

Manual Snapshot Restore v1.9.2

Manual Snapshot Restore timetable-fix (a67fc1f)

pkazmierczak left a comment

Choose a reason for hiding this comment

tgross commented Nov 11, 2024 • edited Loading

Nomad `v1.9.1` to `v1.9.2`

Nomad `v1.9.1` to `timetable-fix` (`a67fc1f`)

Manual Snapshot Restore `v1.9.2`

Manual Snapshot Restore `timetable-fix` (`a67fc1f`)

tgross commented Nov 11, 2024 •

edited

Loading