Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ropsten almost syncs with snap, then stops #22568

Closed
jakubgs opened this issue Mar 24, 2021 · 13 comments
Closed

Ropsten almost syncs with snap, then stops #22568

jakubgs opened this issue Mar 24, 2021 · 13 comments
Labels

Comments

@jakubgs
Copy link

jakubgs commented Mar 24, 2021

System information

Geth version: 1.10.1
OS & Version: Ubuntu 20.04.1

Expected behaviour

I expect Ropsten testnet to sync with snap sync mode.

Actual behaviour

The sync goes very well for a few hours:

image

But once it reaches block 0x971baa it just stops syncing:

image

The highestBlock keeps going up, while the currentBlock is stuck on 0x971baa:

 > rpc.sh eth_syncing | jq -c '{ current: .result.currentBlock, highest: .result.highestBlock }'
{"current":"0x971baa","highest":"0x971c04"}
 > rpc.sh eth_syncing | jq -c '{ current: .result.currentBlock, highest: .result.highestBlock }'
{"current":"0x971baa","highest":"0x971c05"}
 > rpc.sh eth_syncing | jq -c '{ current: .result.currentBlock, highest: .result.highestBlock }'
{"current":"0x971baa","highest":"0x971c06"}

Steps to reproduce the behaviour

Here are my flags:

--ropsten
--v5disc
--syncmode=snap
--light.serve=90
--metrics
--nousb
--verbosity=3
--maxpeers=50
--maxpendpeers=30
--port=30303
--nat=extip:1.2.3.4
--bootnodes="enode://6332792c4a00e3e4ee0926ed89e0d27ef985424d97b6a45bf0f23e51f0dcb5e66b875777506458aea7af6f9e4ffb69f43f3778ee73c81ed9d34c51c4b16b0b0f@52.232.243.152:30303,enode://94c15d1b9e2fe7ce56e458b9a3b672ef11894ddedd0c6f247e0f1d3487f52b66208fb4aeb8179fce6e3a749ea93ed147c37976d67af557508d199d9594c35f09@192.81.208.223:30303"
@jakubgs
Copy link
Author

jakubgs commented Mar 24, 2021

Actually, I just saw it roll back a bunch of syncing progress:

Looking for peers                        peercount=1 tried=6  static=0
Loaded most recent local header          number=9901059 hash="cb087e…5e04b4" td=33005760556867404 age=9h55m42s
Loaded most recent local full block      number=0       hash="419410…ca4a2d" td=1048576           age=51y11mo2w
Loaded most recent local fast block      number=9901059 hash="cb087e…5e04b4" td=33005760556867404 age=9h55m42s
Loaded last fast-sync pivot marker       number=9903017
Rolled back chain segment                header=9903109->9901059 fast=9903043->9901059 block=0->0 reason="syncing canceled (requested)"
Synchronisation failed, dropping peer    peer=89c07745507f7c00041ced1abbb55e2f4e34f7e5561a5e093353b06e33102aae err="no peers available or all tried for download"
Ethereum peer removal failed             peer=89c07745 err="peer not registered"
Looking for peers                        peercount=0 tried=6  static=0

@jakubgs
Copy link
Author

jakubgs commented Mar 24, 2021

What I also don't get is how is the currentBlock and highestBlock difference so small, while in the logs the sync progress appears as just 27.56%:

State sync in progress        synced=27.56% state=11.63GiB   [email protected]    [email protected]   [email protected]    eta=19h14m52.098s
State sync in progress        synced=27.56% state=11.64GiB   [email protected]    [email protected]   [email protected]    eta=19h15m26.794s

And as far as I can tell the size just keeps growing without making progress.

@jakubgs
Copy link
Author

jakubgs commented Mar 24, 2021

I might add this is my third attempt at syncing Ropsten with snap. The previous time my progress past certain point looked like this:

image

Which is... interesting.

@jakubgs
Copy link
Author

jakubgs commented Mar 24, 2021

Not sure if this is the same issue as described in #22555.
Also #22497 might be related or the same issue.

@jakubgs
Copy link
Author

jakubgs commented Mar 24, 2021

I will attempt to use --snapshot=false as suggested in #22497 (comment) and see what happens tomorrow.

@jakubgs
Copy link
Author

jakubgs commented Mar 25, 2021

Same thing, synced "nicely" for the first ~6 hours:

image

But then after it reached 80 blocks left to sync at 03:16:00 it stopped syncing:

image

This is the same result as I achieved in my previous attempt: #22568 (comment)

Seems like --snapshot=false did nothing.

@holiman
Copy link
Contributor

holiman commented Mar 25, 2021

Syncing Ethereum is a pain point for many people, so I'll try to detail what's happening behind the scenes so there might be a bit less confusion.

The current default mode of sync for Geth is called fast sync. Instead of starting from the genesis block and reprocessing all the transactions that ever occurred (which could take weeks), fast sync downloads the blocks, and only verifies the associated proof-of-works. Downloading all the blocks is a straightforward and fast procedure and will relatively quickly reassemble the entire chain.

Many people falsely assume that because they have the blocks, they are in sync. Unfortunately this is not the case, since no transaction was executed, so we do not have any account state available (ie. balances, nonces, smart contract code and data). These need to be downloaded separately and cross checked with the latest blocks. This phase is called the state trie download and it actually runs concurrently with the block downloads; alas it take a lot longer nowadays than downloading the blocks.

So, what's the state trie? In the Ethereum mainnet, there are a ton of accounts already, which track the balance, nonce, etc of each user/contract. The accounts themselves are however insufficient to run a node, they need to be cryptographically linked to each block so that nodes can actually verify that the account's are not tampered with. This cryptographic linking is done by creating a tree data structure above the accounts, each level aggregating the layer below it into an ever smaller layer, until you reach the single root. This gigantic data structure containing all the accounts and the intermediate cryptographic proofs is called the state trie.

Ok, so why does this pose a problem? This trie data structure is an intricate interlink of hundreds of millions of tiny cryptographic proofs (trie nodes). To truly have a synchronized node, you need to download all the account data, as well as all the tiny cryptographic proofs to verify that noone in the network is trying to cheat you. This itself is already a crazy number of data items. The part where it gets even messier is that this data is constantly morphing: at every block (15s), about 1000 nodes are deleted from this trie and about 2000 new ones are added. This means your node needs to synchronize a dataset that is changing 200 times per second. The worst part is that while you are synchronizing, the network is moving forward, and state that you begun to download might disappear while you're downloading, so your node needs to constantly follow the network while trying to gather all the recent data. But until you actually do gather all the data, your local node is not usable since it cannot cryptographically prove anything about any accounts.

If you see that you are 64 blocks behind mainnet, you aren't yet synchronized, not even close. You are just done with the block download phase and still running the state downloads. You can see this yourself via the seemingly endless Imported state entries [...] stream of logs. You'll need to wait that out too before your node comes truly online.


Q: The node just hangs on importing state enties?!

A: The node doesn't hang, it just doesn't know how large the state trie is in advance so it keeps on going and going and going until it discovers and downloads the entire thing.

The reason is that a block in Ethereum only contains the state root, a single hash of the root node. When the node begins synchronizing, it knows about exactly 1 node and tries to download it. That node, can refer up to 16 new nodes, so in the next step, we'll know about 16 new nodes and try to download those. As we go along the download, most of the nodes will reference new ones that we didn't know about until then. This is why you might be tempted to think it's stuck on the same numbers. It is not, rather it's discovering and downloading the trie as it goes along.

Q: I'm stuck at 64 blocks behind mainnet?!

A: As explained above, you are not stuck, just finished with the block download phase, waiting for the state download phase to complete too. This latter phase nowadays take a lot longer than just getting the blocks.

Q: Why does downloading the state take so long, I have good bandwidth?

A: State sync is mostly limited by disk IO, not bandwidth.

The state trie in Ethereum contains hundreds of millions of nodes, most of which take the form of a single hash referencing up to 16 other hashes. This is a horrible way to store data on a disk, because there's almost no structure in it, just random numbers referencing even more random numbers. This makes any underlying database weep, as it cannot optimize storing and looking up the data in any meaningful way.

Not only is storing the data very suboptimal, but due to the 200 modification / second and pruning of past data, we cannot even download it is a properly pre-processed way to make it import faster without the underlying database shuffling it around too much. The end result is that even a fast sync nowadays incurs a huge disk IO cost, which is too much for a mechanical hard drive.

Q: Wait, so I can't run a full node on an HDD?

A: Unfortunately not. Doing a fast sync on an HDD will take more time than you're willing to wait with the current data schema. Even if you do wait it out, an HDD will not be able to keep up with the read/write requirements of transaction processing on mainnet.

You however should be able to run a light client on an HDD with minimal impact on system resources. If you wish to run a full node however, an SSD is your only option.

@holiman
Copy link
Contributor

holiman commented Mar 25, 2021

The text above is a bottled response, but the effect you're seeing of "I'm close to the head, then it goes away" the same for snap sync. The blocks are downloaded, but the state (in the form of snapshot portions) are still being downloaded. The jagged edges ae from occasional pivot movements, when we refocus on the stateroot for a newer block, because the old block got too old.

@holiman
Copy link
Contributor

holiman commented Mar 25, 2021

State sync in progress        synced=27.56% state=11.63GiB   [email protected]    [email protected]   [email protected]    eta=19h14m52.098s
State sync in progress        synced=27.56% state=11.64GiB   [email protected]    [email protected]   [email protected]    eta=19h15m26.794s

You were doing progress there^ See the slot-count, it was increasing.

@jakubgs
Copy link
Author

jakubgs commented Mar 25, 2021

Thanks for the detailed explanation. Truth be told the writeup you did in #22568 (comment) should probably be put somewhere in https://github.com/ethereum/go-ethereum/wiki. It explains well how the sync process works. Or even better, somewhere in https://geth.ethereum.org/docs/.

Appreciate the explanation. I'll let it go for longer and see what happens.

@jakubgs
Copy link
Author

jakubgs commented Mar 25, 2021

I see there are no metrics for snap sync progress:

 > curl -s 'localhost:6060/debug/metrics/prometheus' | grep progress
 > curl -s 'localhost:6060/debug/metrics/prometheus' | grep sync    
# TYPE rpc_duration_eth_syncing_success_count counter
rpc_duration_eth_syncing_success_count 33
# TYPE rpc_duration_eth_syncing_success summary
rpc_duration_eth_syncing_success {quantile="0.5"} 53690
rpc_duration_eth_syncing_success {quantile="0.75"} 57922.5
rpc_duration_eth_syncing_success {quantile="0.95"} 97423.8999999999
rpc_duration_eth_syncing_success {quantile="0.99"} 119686
rpc_duration_eth_syncing_success {quantile="0.999"} 119686
rpc_duration_eth_syncing_success {quantile="0.9999"} 119686

Is there maybe an RPC call that can give me the current snap sync progress?

@jakubgs
Copy link
Author

jakubgs commented Mar 25, 2021

The only call I know about is eth_syncing:

 > rpc.sh eth_syncing
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "currentBlock": "0x971e94",
    "highestBlock": "0x971ee5",
    "knownStates": "0x0",
    "pulledStates": "0x0",
    "startingBlock": "0x9716ce"
  }
}

But the knownStates and pulledStates appears to show nothing.

@jakubgs
Copy link
Author

jakubgs commented Mar 25, 2021

Okay, looks like the full snap sync took almost 24 hours and take up about ~100 GB:

image

The initial block sync does most of the job in the first ~7 hours, after that the state trie sync took ~17 hours.

Though I'd really like to see the progress indicators in the eth_syncing RPC call and/or via the metrics endpoint.

@jakubgs jakubgs closed this as completed Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants