-
Notifications
You must be signed in to change notification settings - Fork 20.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geth --fast stalls before crossing finish line #15001
Comments
I have been having a similar issue recently. Ubuntu 16.04. Stalling on the last ~100-200 blocks. Restarting the geth client has allowed for some of those missing blocks to be processed but it does not keep up with the highest block. The only fluctuation I see in eth.syncing is the number of knownStates and pulledStates. |
I am having the exact same issue as Laughing Cabbage has described, also on Ubuntu 16.04, and also stuck on the last few hundred blocks. |
If any current devs think they might have a lead as to where a good starting point might be for tracking this issue I'm happy to do some bug hunting, please let me know. |
@Dirksterson @laughingcabbage I have exactly the same issues for past week and so do many of my colleagues. After latest advice to run
|
Same here. I am also on v1.6.7. Current status, after running it for more than a week:
|
The similar issue here. On Aug, 16th I had almost fully synced blockchain, just 10-20 hours behind the current block. I then started geth as:
All the time geth is behind the current block. Currently (Aug, 21st) its state is:
whereas etherscan.io shows 4185672 as the last block. There are no errors in geth's output, it is in its normal state of slowly importing new segments and using HDD at speed 5-10 MB/s (both reading and writting). No high CPU usage.
My geth is:
|
same issue here, started around the same time. Looks like this is throughout everyone and affecting parity users also now |
Hello, Ubuntu 16.04 here and same issue: got stuck on the last ~2000 blocks. |
If you dont got ssd u aint ever going to get them.
If u got ssd, just constantly restart the docker container and your client and pray. eventually after several days. You must be persistant.. it will crash randomly and then when you reopen it will be syncing.
Really it took me 2 weeks to do this.
… On 28 Aug 2017, at 21:11, dax ***@***.***> wrote:
Hello, Ubuntu 16.04 here and same issue: got stuck on the last ~2000 blocks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Same problem. Can't sync last ~100 blocks on 1.6.7. Restarting gets close but lots of |
Try using a ssd drive and docker image. This is working for me, expect atleast 4 hours to sync up. Sundays when volume is low is a good time to try
… On 1 Sep 2017, at 20:01, Hasham Ahmad ***@***.***> wrote:
Same problem. Can't sync last ~100 blocks on 1.6.7. Restarting gets close but lots of Stalling state sync, dropping peer messages
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@wtfiwtz I don't really know enough about the whole process, but I would say yeah probably... for what it's worth... |
I was able to get it to successfully sync up, after switching from Not sure if the |
Ok I had to restart the sync from the beginning and have hit this problem again... This is what I have found... Blocks are getting discarded from peers because the
The height is retrieved from a callback function such as this:
So this is probably an issue with switching between fast and normal sync modes, where the chain height is assumed to be 0 when it should be equal to the fast chain height on initialization. Is this an area you are familiar with @karalabe since you did the original fast sync implementation? |
If the peer's total diffficulty is much lower, does that mean they are only on the full sync mode and won't work with a fast sync peer?
Pretty much can't find any peers that are not with a significantly lower total difficulty! This worries me because of the following comment:
|
Ok I left it running this morning, and at some point, it flipped from
The key log messages here are So I'm guessing that the network is starved of fast blocks, and they haven't yet reached their intended Also note that you can't force it to use Is there some way to force this flipping from |
I got this peer syncin problem constantly because i was not on usa time server but using asian one.
After changing my ntp time server settings would get 20 peers connecting - previous was 1 to 3. The peers connect but still same errors you show in log.
Currently only a few of our machines wallets will finish sync, latest macs newer then 2015 find it easiest. my 2011 mac is slowest. All have ssd. all are using fibre 100mb connections.
thanks for support
… On 24 Sep 2017, at 09:20, Nigel Sheridan-Smith ***@***.***> wrote:
Ok I left it running this morning, and at some point, it flipped from fast to full mode when it received just 1 more chain segment:
INFO [09-24|09:07:27] Peer discarded announcement peer=b057fec043b525ed number=4305853 hash=37f333…4a74d9 distance=4305853
INFO [09-24|09:07:27] ** Total difficulty ours="{neg:false abs:[14195334218426315772 54]}" theirs=17179869184
INFO [09-24|09:07:27] ** fast sync? peer=b057fec043b525ed enabled=1
INFO [09-24|09:07:27] ** Block number num=4305854
INFO [09-24|09:07:27] ** Chain height num=0
WARN [09-24|09:07:27] Discarded propagated block, too far away peer=b057fec043b525ed number=4305854 hash=2e8a61…ae021f distance=4305854
INFO [09-24|09:07:27] Imported new state entries count=448 elapsed=1.479ms processed=2608239 pending=2047 retry=2 duplicate=2846 unexpected=8434
INFO [09-24|09:07:29] Imported new state entries count=779 elapsed=3.995ms processed=2609018 pending=2225 retry=22 duplicate=2846 unexpected=8434
INFO [09-24|09:07:29] ** Total difficulty ours="{neg:false abs:[14195334218426315772 54]}" theirs=17179869184
INFO [09-24|09:07:29] ** fast sync? peer=479032d8362da82d enabled=1
INFO [09-24|09:07:31] Imported new state entries count=1089 elapsed=10.173ms processed=2610107 pending=1483 retry=1 duplicate=2846 unexpected=8434
INFO [09-24|09:07:35] Imported new state entries count=1081 elapsed=14.713ms processed=2611188 pending=48 retry=0 duplicate=2846 unexpected=8434
INFO [09-24|09:07:35] Imported new state entries count=35 elapsed=853.5µs processed=2611223 pending=0 retry=0 duplicate=2846 unexpected=8434
INFO [09-24|09:07:35] Imported new block receipts count=0 elapsed=3.752ms bytes=0 number=4305451 hash=ac92d6…397f6c ignored=1
INFO [09-24|09:07:35] Committed new head block number=4305451 hash=ac92d6…397f6c
INFO [09-24|09:07:35] Imported new chain segment blocks=1 txs=17 mgas=0.442 elapsed=28.174ms mgasps=15.701 number=4305452 hash=4a61da…5f72e4
ERROR[09-24|09:07:35]
########## BAD BLOCK #########
Chain config: {ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Byzantium: 9223372036854775807 Engine: ethash}
Number: 4305453
Hash: 0x6c4471bed33ac85f132153650f4f69230e9ef972ff33cba1e79795fb72130c66
Error: unknown ancestor
##############################
WARN [09-24|09:07:35] Synchronisation failed, dropping peer peer=cb8ebbf8130355a7 err="retrieved hash chain is invalid"
ERROR[09-24|09:07:35] Fast sync complete, auto disabling
INFO [09-24|09:07:35] Removing p2p peer id=cb8ebbf8130355a7 conn=inbound duration=1h32m36.442s peers=24 req=false err="useless peer"
INFO [09-24|09:07:36] Ethereum peer connected id=8453dbef52518caf conn=dyndial name=Geth/v1.6.7-stable-ab5646c5/linux-amd64/go1.8.1
INFO [09-24|09:07:36] ** Total difficulty ours="{neg:false abs:[14195334218426315772 54]}" theirs=1009137134152556054860
INFO [09-24|09:07:36] ** fast sync? peer=479032d8362da82d enabled=0
WARN [09-24|09:07:36] Ethereum handshake failed id=8453dbef52518caf conn=dyndial err="Genesis block mismatch - 6577484f58748da6 (!= d4e56740f876aef8)"
INFO [09-24|09:07:36] Removing p2p peer id=8453dbef52518caf conn=dyndial duration=279.836ms peers=24 req=false err="Genesis block mismatch - 6577484f58748da6 (!= d4e56740f876aef8)"
INFO [09-24|09:07:37] Peer discarded announcement peer=ca40c7662d6ac5ed number=4305853 hash=37f333…4a74d9 distance=402
INFO [09-24|09:07:37] Peer discarded announcement peer=ca40c7662d6ac5ed number=4305854 hash=2e8a61…ae021f distance=403
INFO [09-24|09:07:38] Ethereum peer connected id=6949cab8fc6d09bd conn=inbound name=Geth/v1.6.2-unstable-2a41e76b/linux-amd64/go1.8.3
The key log messages here are Committed new head block and Imported new chain segment, which allows the full head blockchain count to update.
So I'm guessing that the network is starved of fast blocks, and they haven't yet reached their intended pivot point... before they flip to full mode.
Also note that you can't force it to use full mode on the command line, it doesn't work.
Is there some way to force this flipping from fast to full mode prematurely? Perhaps if we haven't received a new chain segment for over an hour? Or find a peer that has what we are looking for with a more broader peer search?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
You'll probably find it much easier to be on I think someone needs to re-architect the fast sync in |
Thanks yeh we run parity but it wont sync just the same problem so what we do is run the docker image and everytime it fucks up 'turn it off an on again' Say a little prayer an one out of ten it will work. This is only way we have found to sync to run our business we have one employee her job just come in an run sync every day at 8am before we start then can teamviewer that machine. Ethereum is a labour of love right now.. dunno how that impression affects the new comers out there. Probably aint helping adoption being so un-user friendly.
… On 25 Sep 2017, at 07:34, Nigel Sheridan-Smith ***@***.***> wrote:
You'll probably find it much easier to be on Parity (https://parity.io) - the wallet can do a light-mode sync in around 20-30 minutes... this is a good short-to-medium term solution. However, on Mac you need to be on OS X Sierra (or use the brew install instead)
I think someone needs to re-architect the fast sync in geth as the client needs to reach out to more diverse peers when it gets "stuck" for long periods of time like this. I have a few ideas, but very limited time, and it really needs to be done (or reviewed) by someone who knows what they are doing :P
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Mergathal that is the nature of a blockchain-based approach. Since BitCoin only has blocks targeting every 10 minutes, the throughput is lower and the number of blocks is lower. Ethereum generates a new block every 30-60 seconds, allowing more transactions and faster response times. There will naturally be more data generated due to this approach. The data would need to be pruned somehow to keep it at a reasonable level. Interestingly, in http://www.freekpaans.nl/2018/04/anatomy-geth-fast-sync/, it only took 77Gb of data in the blockchain stored locally for a completed fast sync. I've routinely destroyed fast syncs with much more data than that (... I have limited space on MacBook Pro). It seems to me that the longer that you are pulling down the state tries, the more data that is stored locally. It may also depend on how long you are "full syncing" for as well, once the fast sync is complete. I'm yet to fully understand why but it's an interesting observation. |
we constantly 'refresh' by fast sync from scratch to keep the size in check. An initial fast sync is only around 60G(as of may be a month ago) then the size grow. after one month we are seeing 140G. Not sure if it is because older state needs to be pulled in or what. Does anyone with 'true' full sync knows the current disk size ? |
@garyng2000 a full sync took 220Gb according to the articles linked above. So it would be approximately 80Gb a month as a "fast sync" switches to a "full sync". |
@wtfiwtz |
@garyng2000 it could be because the accumulated state is bigger as you participate in the immediate verification of the transactions, where as post-verification is not as much information to download from peers. However, you would need someone more knowledgeable about Ethereum's inner workings to confirm or deny that. |
I'm on geth v1.8.4 and Ubuntu 16.04. Not only is geth stopping before final sync, but it completely stalls around 30-60 minutes after starting a sync. The CPU usage drops to ~3% of capacity and stays there. I see continuous error messages for connecting to nodes, and the state and blocks completely stop updating. I have to restart geth (I use systemd restart). This is very concerning because I don't want my node to stall in the middle of serving our dapp. |
@GeeeCoin you might want to try v1.8.3 - have a simular issue to yours when I moved from .3 to .4 |
@suspended v1.8.6 has the same unresolved issue. **downgrading to geth v1.8.3 worked for about 3 weeks, but now facing the same issues |
I am also having the same sync problems... dropping peers etc. I am almost synced (about 50-100 blocks behind if I let it run). If I restart geth it catches up until peers start to drop again. Using Ubuntu 16.04. I have tried different versions of Geth down to 1.8.2. Built the dev version too with no change. I have lots of experience running a node having done it since the start... but I did re-download the block chain a month or 2 ago. I use a SATA 500GB SSD but it is encrypted on the drive level and the home directory which is where the blockchain is stored. The encryption means that the read/write abilities are slower and using a disk monitor it shows a high level of activity constantly while geth is running. I understand storing/using the blockchain on encrypted drive is probably not the best setup (for speed and amount of read writes/life of SSD) so I'm guessing the next thing I should try is a new separate un-encrypted SSD to store the chain... but I have not got round to doing so yet (having another SSD purely for eth blockchain is fairly expensive option). Currently my chaindata folder is 358.8GB Looks like Ubuntu 16.04 is a consistent part of this thread/problem? |
@mtj151 good observation. I'm not ruling out any factors at this point. Is anyone using AWS by any chance? |
I have also noticed that I am unable to send transactions while I am getting the "Synchronisation failed, retrying err="block download cancelled (requested)"" warnings. I sent one transaction fine but then the warnings come up and it wouldn't let me send another transaction (even after the messages stopped and syncing started again). I had to completely restart geth to be able to send the transaction. |
@GeeeCoin I was unable to get a Geth node to stay up to date with chaintip on AWS in any meaningful time without using Provisioned IOPS SSDs on EBS-optimized instances or the |
@10A7 appreciate the data point. If NUC is outperforming a quad core with 8GB in AWS, that's a problem. Amazon may have network latency that hasn't been optimized with the |
Sounds like 10a7 had the same problem with lagging behind the chain tip... good description of the problem. Did NVMe SSD fix the problem?? I'm looking at getting one in the coming weeks to run geth. |
@mtj151 NVMe SSD doesn't seem to matter. I have no trouble keeping SATA SSDs and bcache-fronted magnetic arrays intact and synced I/O wise. If you are synced and "importing new chain segment", it seems to mostly be network issues that cause my nodes to fall behind. Restarting geth often helps to get different peers. Geth sync-after-fast-pivot is also much more reliable for me if I am not behind a NAT, and can forward/open 30303/tcp. |
FWIW I was able to get geth to fully sync by waiting until eth.blockNumber is near the numbers in eth.syncing and then restarting geth. I was able to do this at ~160m states. After restarting geth, it took about 20 min to catch up to the blockchain and now eth.syncing is false and the only output now is 'imported new chain segment' every time a new block is found. |
@ The current default mode of sync for Geth is called fast sync. Instead of starting from the genesis block and reprocessing all the transactions that ever occurred (which could take weeks), fast sync downloads the blocks, and only verifies the associated proof-of-works. Downloading all the blocks is a straightforward and fast procedure and will relatively quickly reassemble the entire chain. Many people falsely assume that because they have the blocks, they are in sync. Unfortunately this is not the case, since no transaction was executed, so we do not have any account state available (ie. balances, nonces, smart contract code and data). These need to be downloaded separately and cross checked with the latest blocks. This phase is called the state trie download and it actually runs concurrently with the block downloads; alas it take a lot longer nowadays than downloading the blocks. So, what's the state trie? In the Ethereum mainnet, there are a ton of accounts already, which track the balance, nonce, etc of each user/contract. The accounts themselves are however insufficient to run a node, they need to be cryptographically linked to each block so that nodes can actually verify that the account's are not tampered with. This cryptographic linking is done by creating a tree data structure above the accounts, each level aggregating the layer below it into an ever smaller layer, until you reach the single root. This gigantic data structure containing all the accounts and the intermediate cryptographic proofs is called the state trie. Ok, so why does this pose a problem? This trie data structure is an intricate interlink of hundreds of millions of tiny cryptographic proofs (trie nodes). To truly have a synchronized node, you need to download all the account data, as well as all the tiny cryptographic proofs to verify that noone in the network is trying to cheat you. This itself is already a crazy number of data items. The part where it gets even messier is that this data is constantly morphing: at every block (15s), about 1000 nodes are deleted from this trie and about 2000 new ones are added. This means your node needs to synchronize a dataset that is changing 200 times per second. The worst part is that while you are synchronizing, the network is moving forward, and state that you begun to download might disappear while you're downloading, so your node needs to constantly follow the network while trying to gather all the recent data. But until you actually do gather all the data, your local node is not usable since it cannot cryptographically prove anything about any accounts. If you see that you are 64 blocks behind mainnet, you aren't yet synchronized, not even close. You are just done with the block download phase and still running the state downloads. You can see this yourself via the seemingly endless Q: The node just hangs on importing state enties?! A: The node doesn't hang, it just doesn't know how large the state trie is in advance so it keeps on going and going and going until it discovers and downloads the entire thing. The reason is that a block in Ethereum only contains the state root, a single hash of the root node. When the node begins synchronizing, it knows about exactly 1 node and tries to download it. That node, can refer up to 16 new nodes, so in the next step, we'll know about 16 new nodes and try to download those. As we go along the download, most of the nodes will reference new ones that we didn't know about until then. This is why you might be tempted to think it's stuck on the same numbers. It is not, rather it's discovering and downloading the trie as it goes along. Q: I'm stuck at 64 blocks behind mainnet?! A: As explained above, you are not stuck, just finished with the block download phase, waiting for the state download phase to complete too. This latter phase nowadays take a lot longer than just getting the blocks. Q: Why does downloading the state take so long, I have good bandwidth? A: State sync is mostly limited by disk IO, not bandwidth. The state trie in Ethereum contains hundreds of millions of nodes, most of which take the form of a single hash referencing up to 16 other hashes. This is a horrible way to store data on a disk, because there's almost no structure in it, just random numbers referencing even more random numbers. This makes any underlying database weep, as it cannot optimize storing and looking up the data in any meaningful way. Not only is storing the data very suboptimal, but due to the 200 modification / second and pruning of past data, we cannot even download it is a properly pre-processed way to make it import faster without the underlying database shuffling it around too much. The end result is that even a fast sync nowadays incurs a huge disk IO cost, which is too much for a mechanical hard drive. Q: Wait, so I can't run a full node on an HDD? A: Unfortunately not. Doing a fast sync on an HDD will take more time than you're willing to wait with the current data schema. Even if you do wait it out, an HDD will not be able to keep up with the read/write requirements of transaction processing on mainnet. You however should be able to run a light client on an HDD with minimal impact on system resources. If you wish to run a full node however, an SSD is your only option. |
@karalabe Thanks for breaking this down again. We knew most of this about Geth/Eth already, but I'm really surprised as to how suboptimal the state trie system is at being stored to disk; I thought the whole point of building ethereum this way (with modified patricia trees etc.) was to minimize footprint/disk mods, but looks like innovation in storage structures is still needed. |
@karalabe . Nice introduction. Understanding fast sync internal better. |
@karalabe So is there any way of knowing how close you are to being finished syncing? None of the metrics from the |
#16558 If those are actually implemented, you'll at least be able to scrape the number of states from an external reference. |
System information
Geth version:
geth version
1.5.9-stable, Go1.7.4OS & Version: OSX 10.12.6 MacMini 4GB RAM (latest MacMini doesn't support field RAM upgrade anymore) VDSL connection with an average of 20-40Mbit throughput. Ethereum Wallet 0.9.0
Commit hash : (if
develop
)Expected behaviour
fast sync to current latest block followed by auto disabling
Actual behaviour
stalling from a few thousand blocks up to a few hundred to current latest block. Tries to catch up to latest block, but number of new blocks is greater than the speed of adding fast blocks. Never auto disables fast sync mode.
Steps to reproduce the behaviour
Removedb and geth --fast --cache=1024. 5 times on that machine over the last weeks.
Fast sync is already my workaround, starting a fresh fast sync from scratch. Before I was unsuccessful on that machine trying to sync with existing blockchain data instead. This was also a lost race of catching up to the latest block on that machine. This workaround was good until now.
Today even the workaround in fast sync mode (cache -1024) will not completely load the blockchain anymore. It catches up some hundred blocks to the latest block and stalls for hours. By the time it catches up a few hundred blocks, the latest block moved ahead again. The closer geth is getting to import to the latest block (at time of writing 4173161), the slower it gets. It does not catch up anymore. Tried 5 times now over the last weeks and giving up at around 4-5 days each.
Does the machine not meet todays minimum hardware requirement anymore or is this a major bug?
Backtrace
latest block 13 hours ago (!)
I0818 00:15:26.444933 core/blockchain.go:805] imported 148 receipts in 2.775s. #4169952 [e3f556fc… / 36f4d3c9…]
...
latest header chain 50 minutes ago
I0818 12:47:45.107445 core/headerchain.go:342] imported 1 headers in 4.954ms. #4173009 [350d1426… / 350d1426…]
...
currently only importing nothing but state entries
I0818 13:36:41.103101 eth/downloader/downloader.go:966] imported 172 state entries in 10.009s: processed 10010213, pending at least 129361
I0818 13:36:41.103131 eth/downloader/downloader.go:966] imported 384 state entries in 783.519ms: processed 10010597, pending at least 129361
I0818 13:36:41.103154 eth/downloader/downloader.go:966] imported 381 state entries in 6.963s: processed 10010978, pending at least 129361
I0818 13:36:41.103167 eth/downloader/downloader.go:966] imported 25 state entries in 87.654ms: processed 10011003, pending at least 129360
I0818 13:36:46.014244 eth/downloader/downloader.go:966] imported 384 state entries in 2.482s: processed 10011387, pending at least 127584
I0818 13:36:49.074483 eth/downloader/downloader.go:966] imported 381 state entries in 7.082s: processed 10011768, pending at least 127105
I0818 13:36:49.074553 eth/downloader/downloader.go:966] imported 384 state entries in 7.971s: processed 10012152, pending at least 127105
I0818 13:36:49.074574 eth/downloader/downloader.go:966] imported 384 state entries in 3.772s: processed 10012536, pending at least 127105
I0818 13:36:49.074603 eth/downloader/downloader.go:966] imported 162 state entries in 5.822s: processed 10012698, pending at least 127105
I0818 13:36:49.074622 eth/downloader/downloader.go:966] imported 25 state entries in 4.050s: processed 10012723, pending at least 127105
I0818 13:36:49.074639 eth/downloader/downloader.go:966] imported 381 state entries in 3.060s: processed 10013104, pending at least 127105
I0818 13:36:49.074742 eth/downloader/downloader.go:966] imported 85 state entries in 7.117s: processed 10013189, pending at least 127105
I0818 13:36:49.074765 eth/downloader/downloader.go:966] imported 375 state entries in 2.219s: processed 10013564, pending at least 127105
I0818 13:36:49.074782 eth/downloader/downloader.go:966] imported 87 state entries in 3.915s: processed 10013651, pending at least 127105
I0818 13:36:49.074795 eth/downloader/downloader.go:966] imported 23 state entries in 271.734ms: processed 10013674, pending at least 127104
The text was updated successfully, but these errors were encountered: