-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Pause sync when EE is offline #3428
[Merged by Bors] - Pause sync when EE is offline #3428
Conversation
…e execution layer is not synced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I think returning a bool
in the watcher channel could be potentially confusing. I added my suggestion here pawanjay176@070c4b9
Let me know what you think :) Just some other minor nits.
impl Default for State { | ||
fn default() -> Self { | ||
let state = EngineState::default(); | ||
let (notifier, _receiver) = watch::channel(state.is_online()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the default state is Offline
, wouldn't that lead to the sync manager reading the initial sync state as Offline
and pausing all sync given that default state in sync is Online
. Maybe we can start as Online
and change to Offline
if the initial upcheck fails? I might be misunderstanding tokio::watch
here so please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you said is right, however this is what I expect. Since the default state in sync is online, we start normally. If there isn't an execution engine, that state will never be changed. If there is, then the right behaviour is to pause at the beginning. It seems not intuitive but I think it's right
Thanks for the review @pawanjay176 :) all comments except two addressed. I'm leaving those open to wait for more feedback. Also do you think you could help me testing this? |
I'll be testing this locally over the weekend :) also @paulhauner mentioned that we will be provisioning a new BN <-> EE pair in our infra without any connected validators sometime next week. We could test this on those boxes by scheduling a cron job to stop the EE service and restart it after a couple of hours to test this out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! I have tested this on sepolia with shutting down my execution node(geth) for multiple intervals (1minute, 1 epoch, 1 hour and a couple hours) with and without backfill sync and it stops and recovers as expected everytime 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I'm not an expert in the sync code, however the execution_layer
and other components I'm familiar with look good. It also appears to me that the changes are nicely isolated to the sync functions and not leaking "upstream" into the BeaconProcessor
or similar
I haven't had a chance to do this, I'm sorry. I've been waiting on a few servers that were only just provisioned by our cloud provider over the weekend and now we're in merge mode. If anyone wouldn't feel comfortable merging without those tests then I can certainly make it happen! |
I think I'm comfortable merging this. Have done sufficient local testing. |
bors r+ |
## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <[email protected]>
## Issue Addressed sigp#3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of sigp#3094 Will work best with sigp#3439 Co-authored-by: Pawan Dhananjay <[email protected]>
Issue Addressed
#3032
Proposed Changes
Pause sync when ee is offline. Changes include three main parts:
Online/offline notification system
State
that ensures every change is correctly notified. Notifications are only sent if the state changes. The newState
is behind aRwLock
(as before) as the synchronization mechanism.Pause Sync
Sync has four components, pausing is done differently in each:
Resume Sync
Additional Info
QUESTION: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed
Next gen of #3094
Will work best with #3439