Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve beacon node failover in validator client [tracking issue] #3613

Closed
michaelsproul opened this issue Sep 27, 2022 · 3 comments
Closed
Assignees
Labels
enhancement New feature or request RFC Request for comment val-client Relates to the validator client binary

Comments

@michaelsproul
Copy link
Member

Description

I think the beacon node failover feature is in need of some love, particularly now that we are post-merge and it is complicated by the addition of the execution node.

There are several issues:

  • Beacon nodes continue to self-report their status as synced when the execution node goes offline. This means they will be used by validator clients to produce sub-optimal attestations.
  • Beacon nodes self-report their status as synced for eight epochs after their internal sync state switches to syncing. Again this results in sub-optimal attestations.
  • Beacon nodes return nasty 500 internal server errors when the head is optimistic, e.g.
CRIT Error during attestation routine        slot: 4747474, committee_index: 11, error: "All endpoints failed http://localhost:5052/ => RequestFailed("Failed to produce attestation data: ServerMessage(ErrorMessage { code: 500, message: \"UNHANDLED_ERROR: HeadBlockNotFullyVerified { beacon_block_root: 0xbef1f36d129ca41de0a2da31962f5cb5025262f8813c6be2448524fc75be9947, execution_status: Optimistic(0x93219f971e74377cb48ba8050f1f193390d0313afd23c8cc4b1e8a119fb32fa1) }\", stacktraces: [] })")"

This error is probably handled most gracefully by a VC with redundant beacon nodes, as it should failover to the next BN (need to double check this, the log above is from a node without failover BNs).

Option 1: Fail Fast

One way to address all of these issues would be to add a flag to the beacon node like --fail-fast which makes it report its status as unsynced more readily:

The downside of this approach is that it adds configuration complexity: most beacon nodes used for redundancy should be configured with --fail-fast, but at least one beacon node per cluster should not be, in order to not cripple liveness in case of sync issues (which is what SYNC_TOLERANCE_EPOCHS is trying to guard against).

The --fail-fast solution may also still not fail fast enough for some users, Lighthouse will usually consider itself synced for a few epochs (2?) without any blocks, during which time attestations will still miss.

Option 2: Quality Control

Rather than changing the behaviour of the beacon node, we could change the behaviour of the validator client to make it smarter about which beacon nodes to try first. Instead of using a binary "synced" or "not synced" differentiation the VC could use other data to make its choice, e.g. sync distance, optimistic status, execution layer online/offline status.

The complication with this approach is that we need to work within the confines of standard APIs provided by beacon nodes (which e.g. doesn't include EL status), and we need to avoid opening ourselves to attacks where an attacker tricks us into following their chain by proposing blocks closer to the current wall clock slot than the canonical chain.

I think I have a preference for option 2 currently.

@michaelsproul michaelsproul added RFC Request for comment enhancement New feature or request val-client Relates to the validator client binary labels Sep 27, 2022
@michaelsproul
Copy link
Member Author

Related:

Complementary:

  • VC: Publish to multiple BNs #3516. Publishing messages to all connected BNs is an orthogonal improvement that we could make alongside improving the current lightweight fallback system.

@jmcruz1983
Copy link

@michaelsproul
what about implementing following logic on the /eth/v1/node/health endpoint:

  • when execution node is syncing then returning unhealthy (something like 4xx or 5xx http error code)
  • using that as readiness probe in k8s would stop sending traffic to current beacon meanwhile execution node is syncing
    wdyt?

bors bot pushed a commit that referenced this issue May 17, 2023
## Issue Addressed

Closes #4291, part of #3613.

## Proposed Changes

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.


## Why track `newPayload` errors?

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

## Additional Changes

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
ghost pushed a commit to oone-world/lighthouse that referenced this issue Jul 13, 2023
## Issue Addressed

Closes sigp#4291, part of sigp#3613.

## Proposed Changes

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.


## Why track `newPayload` errors?

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

## Additional Changes

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this issue Jan 6, 2024
Closes sigp#4291, part of sigp#3613.

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this issue Jan 6, 2024
Closes sigp#4291, part of sigp#3613.

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
@macladson
Copy link
Member

Since the merge of #4393, we now have more sophisticated fallback behaviour following Option 2 listed in the issue.
I'll close this issue now, and if fallback issues continue to be widespread, new issues should be opened to address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC Request for comment val-client Relates to the validator client binary
Projects
None yet
Development

No branches or pull requests

3 participants