Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value of largeRequestsTimeout is set via --large-requests-timeout parameter when starting nimbus_beacon_node in trustedNodeSync mode. #6781

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

cmd0s
Copy link

@cmd0s cmd0s commented Dec 20, 2024

Problem

In the nimbus_beacon_node operating in trustedNodeSync mode, the largeRequestsTimeout variable is currently hardcoded to 120 seconds. Before August 12, 2024, this value was 90 seconds. Despite the increase to 120 seconds, it remains insufficient for slower internet connections, such as 16 Mbps download speeds. This results in the nimbus_beacon_node failing to complete trustedNodeSync with the server due to timeouts. Increasing this timeout resolves the issue and allows synchronization to succeed.


Solution

Instead of proposing another fixed increase to the largeRequestsTimeout, this Pull Request introduces the ability to configure this value via a command-line parameter when starting nimbus_beacon_node.

  • By default, if the parameter is not provided, the value remains at 120 seconds, ensuring backward compatibility.
  • Optionally, users can specify the timeout using the --large-requests-timeout parameter (e.g., --large-requests-timeout=300) to set largeRequestsTimeout to 300 seconds.

This approach provides flexibility for various network conditions without requiring code changes or recompilation.


Additional Information

I lead the Web3Pi.io project, which enables users to easily and automatically set up a full Ethereum node on Raspberry Pi devices. This setup includes both execution and consensus clients, with Nimbus as the default consensus client.

Our users span many countries, sharing identical hardware and software configurations but differing in internet speeds. Users with slower connections have reported issues with trustedNodeSync, which I traced to the hardcoded timeout value causing synchronization failures. Increasing largeRequestsTimeout resolves the issue but requires editing the code and recompiling from source, which is impractical for many users.

The hardcoded timeout of 120 seconds is a limiting factor for some of my clients and prevents successful synchronization in suboptimal network conditions. Allowing this timeout to be configurable provides a scalable solution to this issue.


Request

I kindly request that this Pull Request be accepted to allow users to set largeRequestsTimeout via a parameter. Alternatively, I propose revisiting the hardcoded value and increasing it further. However, the configurable parameter approach is more flexible and better suited for diverse use cases.

Thank you for your consideration!

… parameter when starting nimbus_beacon_node in trustedNodeSync mode.
@tersec
Copy link
Contributor

tersec commented Dec 23, 2024

One underlying issue is that the "timeout" is strange in that it's for the download, beginning to end. A typical timeout is not, and once it's being downloaded, the timeout stops having any effect.

https://nimbus.guide/trusted-node-sync.html#sync-from-checkpoint-files does document a fallback:

# Obtain a state and a block from a Beacon API - these must be in SSZ format:
curl -o state.finalized.ssz \
  -H 'Accept: application/octet-stream' \
  http://localhost:5052/eth/v2/debug/beacon/states/finalized

# Start the beacon node using the downloaded state as starting point
./run-mainnet-beacon-node.sh \
  --finalized-checkpoint-state=state.finalized.ssz

The issue is primarily that it's more manual and less user-friendly. It does exist, though, and means that trustedNodeSync per se does not have to accommodate all possible network conditions.

In particular there's a practical minimum download speed below which a node won't really run on a network in a sustainable way. 16Mbps is debatably below this point by now, but it's still borderline, so, sure. But if 16Mbps might be fast enough, it's not clear that, say, making TNS "work" on a 5Mbps connection does anyone much good, compared with this more manual fallback if they're truly in a situation where it's actually fine that their node wouldn't be able to keep up with the libp2p gossip, just, make the TNS work please etc etc (I'm going to anticipate that people can come up with such scenarios).

A final point is that both using a command-line parameter and using curl and then a parameter (--finalized-checkpoint-state) are similarly non-default, and probably should be similarly rarely required.

So if you can provide some examples from nodes which can run Ethereum to some useful extent, but time out on TNS, of how long that curl command takes (i.e. how long TNS would be expected to take too, since it downloads the same SSZ from another node's beacon API), I'd be happy to adjust the defaults, and in the longer term, it might be worthwhile to adjust the timeout mechanism itself to stop counting once the TNS state download begins.

But I'm not convinced adding another configuration option on top of --finalized-checkpoint-state is the best approach here.

Copy link

Unit Test Results

       12 files  ±0    1 822 suites  ±0   59m 24s ⏱️ + 1m 4s
  5 327 tests ±0    4 980 ✔️ ±0  347 💤 ±0  0 ±0 
29 521 runs  ±0  29 077 ✔️ ±0  444 💤 ±0  0 ±0 

Results for commit 5be370d. ± Comparison against base commit 06cf78a.

@cmd0s
Copy link
Author

cmd0s commented Dec 23, 2024

Thank you for your response. I agree with your observations, and it’s great to see that there’s a workaround using curl. I can script this approach on my end, and it should do the job.

I’m happy to run tests on the nodes where this issue occurred and share the measurements. This particular case involves a node that, once synchronized, operates correctly and maintains sync on mainnet, but the TNS process failed with a timeout.

It’s possible that a slight increase in the timeout value could resolve the issue.

As it’s the holiday season, it might take me a bit of time to run the tests on the client side, but I’ll make sure to revisit this topic and provide feedback once I have the results.

Wishing you a merry and peaceful holiday season!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants