Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Pause sync when EE is offline #3428

Closed

Conversation

divagant-martian
Copy link
Collaborator

@divagant-martian divagant-martian commented Aug 4, 2022

Issue Addressed

#3032

Proposed Changes

Pause sync when ee is offline. Changes include three main parts:

  • Online/offline notification system
  • Pause sync
  • Resume sync

Online/offline notification system

  • The engine state is now guarded behind a new struct State that ensures every change is correctly notified. Notifications are only sent if the state changes. The new State is behind a RwLock (as before) as the synchronization mechanism.
  • The actual notification channel is a tokio::sync::watch which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
  • Sync waits for state changes concurrently with normal messages.

Pause Sync

Sync has four components, pausing is done differently in each:

  • Block lookups: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
  • Parent lookups: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
  • Range: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
  • Backfill: Not affected by ee states, we don't pause.

Resume Sync

  • Block lookups: Enabled again.
  • Parent lookups: Enabled again.
  • Range: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
  • Backfill: Not affected by ee states, no need to resume.

Additional Info

QUESTION: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439

@divagant-martian divagant-martian changed the title add notifier to ee Pause sync when EE is offline Aug 8, 2022
@divagant-martian divagant-martian marked this pull request as ready for review August 10, 2022 19:32
@paulhauner paulhauner added the ready-for-review The code is ready for review label Aug 10, 2022
@pawanjay176 pawanjay176 added the under-review A reviewer has only partially completed a review. label Aug 12, 2022
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I think returning a bool in the watcher channel could be potentially confusing. I added my suggestion here pawanjay176@070c4b9
Let me know what you think :) Just some other minor nits.

beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/manager.rs Outdated Show resolved Hide resolved
beacon_node/network/src/sync/manager.rs Outdated Show resolved Hide resolved
beacon_node/execution_layer/src/engines.rs Outdated Show resolved Hide resolved
beacon_node/execution_layer/src/engines.rs Outdated Show resolved Hide resolved
beacon_node/execution_layer/src/engines.rs Outdated Show resolved Hide resolved
impl Default for State {
fn default() -> Self {
let state = EngineState::default();
let (notifier, _receiver) = watch::channel(state.is_online());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default state is Offline, wouldn't that lead to the sync manager reading the initial sync state as Offline and pausing all sync given that default state in sync is Online. Maybe we can start as Online and change to Offline if the initial upcheck fails? I might be misunderstanding tokio::watch here so please correct me if I'm wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you said is right, however this is what I expect. Since the default state in sync is online, we start normally. If there isn't an execution engine, that state will never be changed. If there is, then the right behaviour is to pause at the beginning. It seems not intuitive but I think it's right

@pawanjay176 pawanjay176 added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 13, 2022
@divagant-martian divagant-martian added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 13, 2022
@divagant-martian
Copy link
Collaborator Author

Thanks for the review @pawanjay176 :) all comments except two addressed. I'm leaving those open to wait for more feedback. Also do you think you could help me testing this?

@pawanjay176
Copy link
Member

I'll be testing this locally over the weekend :)

also @paulhauner mentioned that we will be provisioning a new BN <-> EE pair in our infra without any connected validators sometime next week. We could test this on those boxes by scheduling a cron job to stop the EE service and restart it after a couple of hours to test this out.

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! I have tested this on sepolia with shutting down my execution node(geth) for multiple intervals (1minute, 1 epoch, 1 hour and a couple hours) with and without backfill sync and it stops and recovers as expected everytime 🎉

Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm not an expert in the sync code, however the execution_layer and other components I'm familiar with look good. It also appears to me that the changes are nicely isolated to the sync functions and not leaking "upstream" into the BeaconProcessor or similar ☺️

@paulhauner
Copy link
Member

also @paulhauner mentioned that we will be provisioning a new BN <-> EE pair in our infra without any connected validators sometime next week. We could test this on those boxes by scheduling a cron job to stop the EE service and restart it after a couple of hours to test this out.

I haven't had a chance to do this, I'm sorry. I've been waiting on a few servers that were only just provisioned by our cloud provider over the weekend and now we're in merge mode. If anyone wouldn't feel comfortable merging without those tests then I can certainly make it happen!

@paulhauner paulhauner added the bellatrix Required to support the Bellatrix Upgrade label Aug 22, 2022
@pawanjay176
Copy link
Member

pawanjay176 commented Aug 22, 2022

I think I'm comfortable merging this. Have done sufficient local testing.

@divagant-martian divagant-martian added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Aug 24, 2022
@michaelsproul
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Aug 24, 2022
## Issue Addressed

#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of #3094

Will work best with #3439 

Co-authored-by: Pawan Dhananjay <[email protected]>
@bors bors bot changed the title Pause sync when EE is offline [Merged by Bors] - Pause sync when EE is offline Aug 25, 2022
@bors bors bot closed this Aug 25, 2022
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

sigp#3032

## Proposed Changes

Pause sync when ee is offline. Changes include three main parts:
- Online/offline notification system
- Pause sync
- Resume sync

#### Online/offline notification system
- The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism.
- The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc.
- Sync waits for state changes concurrently with normal messages.

#### Pause Sync
Sync has four components, pausing is done differently in each:
- **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it.
- **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers.
- **Backfill**: Not affected by ee states, we don't pause.

#### Resume Sync
- **Block lookups**: Enabled again.
- **Parent lookups**: Enabled again.
- **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them.
- **Backfill**: Not affected by ee states, no need to resume.

## Additional Info

**QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed

Next gen of sigp#3094

Will work best with sigp#3439 

Co-authored-by: Pawan Dhananjay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bellatrix Required to support the Bellatrix Upgrade Networking ready-for-merge This PR is ready to merge. v3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants