[Merged by Bors] - Pause sync when EE is offline #3428

divagant-martian · 2022-08-04T16:12:18Z

Issue Addressed

#3032

Proposed Changes

Pause sync when ee is offline. Changes include three main parts:

Online/offline notification system
Pause sync
Resume sync

Online/offline notification system

The engine state is now guarded behind a new struct State that ensures every change is correctly notified. Notifications are only sent if the state changes. The new State is behind a RwLock (as before) as the synchronization mechanism.
The actual notification channel is a tokio::sync::watch which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
Sync waits for state changes concurrently with normal messages.

Pause Sync

Sync has four components, pausing is done differently in each:

Block lookups: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
Parent lookups: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
Range: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
Backfill: Not affected by ee states, we don't pause.

Resume Sync

Block lookups: Enabled again.
Parent lookups: Enabled again.
Range: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
Backfill: Not affected by ee states, no need to resume.

Additional Info

QUESTION: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439

…e execution layer is not synced

pawanjay176

Looking good. I think returning a bool in the watcher channel could be potentially confusing. I added my suggestion here pawanjay176@070c4b9
Let me know what you think :) Just some other minor nits.

beacon_node/network/src/sync/network_context.rs

beacon_node/network/src/sync/range_sync/range.rs

beacon_node/network/src/sync/manager.rs

beacon_node/execution_layer/src/engines.rs

pawanjay176 · 2022-08-13T00:58:56Z

beacon_node/execution_layer/src/engines.rs

+impl Default for State {
+    fn default() -> Self {
+        let state = EngineState::default();
+        let (notifier, _receiver) = watch::channel(state.is_online());


If the default state is Offline, wouldn't that lead to the sync manager reading the initial sync state as Offline and pausing all sync given that default state in sync is Online. Maybe we can start as Online and change to Offline if the initial upcheck fails? I might be misunderstanding tokio::watch here so please correct me if I'm wrong.

What you said is right, however this is what I expect. Since the default state in sync is online, we start normally. If there isn't an execution engine, that state will never be changed. If there is, then the right behaviour is to pause at the beginning. It seems not intuitive but I think it's right

divagant-martian · 2022-08-13T02:39:28Z

Thanks for the review @pawanjay176 :) all comments except two addressed. I'm leaving those open to wait for more feedback. Also do you think you could help me testing this?

pawanjay176 · 2022-08-13T02:44:44Z

I'll be testing this locally over the weekend :)

also @paulhauner mentioned that we will be provisioning a new BN <-> EE pair in our infra without any connected validators sometime next week. We could test this on those boxes by scheduling a cron job to stop the EE service and restart it after a couple of hours to test this out.

pawanjay176

Awesome work! I have tested this on sepolia with shutting down my execution node(geth) for multiple intervals (1minute, 1 epoch, 1 hour and a couple hours) with and without backfill sync and it stops and recovers as expected everytime 🎉

paulhauner

Looks great! I'm not an expert in the sync code, however the execution_layer and other components I'm familiar with look good. It also appears to me that the changes are nicely isolated to the sync functions and not leaking "upstream" into the BeaconProcessor or similar ☺️

paulhauner · 2022-08-22T00:16:31Z

also @paulhauner mentioned that we will be provisioning a new BN <-> EE pair in our infra without any connected validators sometime next week. We could test this on those boxes by scheduling a cron job to stop the EE service and restart it after a couple of hours to test this out.

I haven't had a chance to do this, I'm sorry. I've been waiting on a few servers that were only just provisioned by our cloud provider over the weekend and now we're in merge mode. If anyone wouldn't feel comfortable merging without those tests then I can certainly make it happen!

pawanjay176 · 2022-08-22T01:49:05Z

I think I'm comfortable merging this. Have done sufficient local testing.

michaelsproul · 2022-08-24T23:34:39Z

bors r+

@pawanjay176

## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <[email protected]>

bors · 2022-08-25T02:15:47Z

Pull request successfully merged into unstable.

Build succeeded:

@pawanjay176

## Issue Addressed sigp#3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of sigp#3094 Will work best with sigp#3439 Co-authored-by: Pawan Dhananjay <[email protected]>

divagant-martian added 7 commits August 4, 2022 11:11

add notifier to ee

bb6a503

change notifier to a stream

1c5dba0

lil test

f32a80d

move extra ee sync state logic to the upcheck function

a20e30d

basic pause and resume in sync manager

fdd3140

fix getting the stream

1ce5d71

guard the channel to the beacon processor to prevent using it when th…

28652e3

…e execution layer is not synced

divagant-martian changed the title ~~add notifier to ee~~ Pause sync when EE is offline Aug 8, 2022

divagant-martian added 10 commits August 8, 2022 10:14

natural pause

fed7745

Merge branch 'unstable' into pause-sync-on-ee-offline

3e5ec61

complete active range resume

c061819

add resume test

3341849

fix tests 🤦

33b58a8

move from requiring synced ee to just offline ee

7d3095d

fix function name

e90f1d8

Merge branch 'unstable' into pause-sync-on-ee-offline

21fbbfb

self review

29937c6

more self review

faf6b54

divagant-martian marked this pull request as ready for review August 10, 2022 19:32

divagant-martian requested review from paulhauner, AgeManning and pawanjay176 August 10, 2022 21:46

fix missing change from synced -> online

ff48140

paulhauner added the ready-for-review The code is ready for review label Aug 10, 2022

pawanjay176 added the under-review A reviewer has only partially completed a review. label Aug 12, 2022

pawanjay176 requested changes Aug 13, 2022

View reviewed changes

pawanjay176 added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 13, 2022

pawanjay176 and others added 2 commits August 12, 2022 20:38

Use EngineState in watcher

de1847a

Merge branch 'unstable' into pause-sync-on-ee-offline

9035d60

update some comments

aebdf79

divagant-martian added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 13, 2022

add relevant log

db684a7

divagant-martian requested a review from pawanjay176 August 17, 2022 19:38

pawanjay176 approved these changes Aug 17, 2022

View reviewed changes

pawanjay176 mentioned this pull request Aug 17, 2022

Pause sync when execution layer is offline #3094

Closed

pawanjay176 removed the under-review A reviewer has only partially completed a review. label Aug 17, 2022

pawanjay176 mentioned this pull request Aug 17, 2022

clients/lighthouse-bn: Add Optimistic Sync Options + Trusted Peers with Static Keys ethereum/hive#604

Merged

Merge branch 'unstable' into pause-sync-on-ee-offline

193bf3d

divagant-martian added the v3.1.0 label Aug 19, 2022

paulhauner approved these changes Aug 22, 2022

View reviewed changes

paulhauner added the bellatrix Required to support the Bellatrix Upgrade label Aug 22, 2022

divagant-martian added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Aug 24, 2022

bors bot changed the title ~~Pause sync when EE is offline~~ [Merged by Bors] - Pause sync when EE is offline Aug 25, 2022

bors bot closed this Aug 25, 2022

divagant-martian added the networking label Sep 9, 2022

paulhauner mentioned this pull request Sep 12, 2022

Pause sync whilst EE offline #3032

Closed

michaelsproul mentioned this pull request Sep 27, 2022

Improve beacon node failover in validator client [tracking issue] #3613

Closed

divagant-martian added Networking and removed networking labels Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Pause sync when EE is offline #3428

[Merged by Bors] - Pause sync when EE is offline #3428

divagant-martian commented Aug 4, 2022 •

edited

Loading

pawanjay176 left a comment

pawanjay176 Aug 13, 2022

divagant-martian Aug 13, 2022

divagant-martian commented Aug 13, 2022

pawanjay176 commented Aug 13, 2022

pawanjay176 left a comment

paulhauner left a comment

paulhauner commented Aug 22, 2022

pawanjay176 commented Aug 22, 2022 •

edited

Loading

michaelsproul commented Aug 24, 2022

bors bot commented Aug 25, 2022

[Merged by Bors] - Pause sync when EE is offline #3428

[Merged by Bors] - Pause sync when EE is offline #3428

Conversation

divagant-martian commented Aug 4, 2022 • edited Loading

Issue Addressed

Proposed Changes

Online/offline notification system

Pause Sync

Resume Sync

Additional Info

pawanjay176 left a comment

Choose a reason for hiding this comment

pawanjay176 Aug 13, 2022

Choose a reason for hiding this comment

divagant-martian Aug 13, 2022

Choose a reason for hiding this comment

divagant-martian commented Aug 13, 2022

pawanjay176 commented Aug 13, 2022

pawanjay176 left a comment

Choose a reason for hiding this comment

paulhauner left a comment

Choose a reason for hiding this comment

paulhauner commented Aug 22, 2022

pawanjay176 commented Aug 22, 2022 • edited Loading

michaelsproul commented Aug 24, 2022

bors bot commented Aug 25, 2022

divagant-martian commented Aug 4, 2022 •

edited

Loading

pawanjay176 commented Aug 22, 2022 •

edited

Loading