-
Notifications
You must be signed in to change notification settings - Fork 20.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock? #18402
Comments
Found that i have a lot of block lost on each node.. Can be this the problem? the chain was running with that warnings without any issues anyway.. Btw, it is caused by bad connection between nodes? Im using Digital Ocean droplets NOTE: if i check eth.getBlockNuber i get 488676 or 488675 depending on the sealer |
We experienced a similar deadlock on our fork, and the cause was due to out-of-turn blocks all being difficulty |
Thanks for the response, can you give me an idea of how "compare hashes and recent signers on each node to confirm if your network is deadlocked in the same way." ? Thanks |
By getting the last 2 blocks from each node, you should be able to see exactly why they are stuck based on their view of the world. They all think that they have signed too recently, so they must disagree on what the last few blocks are supposed to be, so you'll see different hashes and signers for blocks with the same number (and difficulty!). |
Good idea!! Sealer 1 Last block 488676 Last -1 = 488675 Sealer 2 Last is 488675 The second node didnt reach the 488676 block. The hashes of block 488675 are different, but the difficulty are differents (1 and 2) For other blocks, like block 8, the hashes are equals and the difficulty is 2 for both.. Seems like all the blocks has 2 of difficulty except that conflictive one..did you find any logical explanation to that? Btw, dont know why difficulty = 2 since the genesis file uses 0x1 Thoughts? |
The in-turn signer always signs with difficulty 2. Out-of-turn signers sign with difficulty 1. This is built-in to the clique protocol, and the primary cause of this problem in the first place. It looks like you have 6 signers. You will have to check them all to make sense of this. |
So, if i found two signers (into my 6) with the same difficulty and different hash the deadlock would make sense right? Same block, different difficulty and different hash doesnt probe anything? I have deleted the chaindata of the other node with the same last block 488675 #fail |
Not necessarily. Those kind of ambiguous splits happen very frequently with clique and would normally sort themselves out. Are you still trying to recover this chain? |
If it is not necesarrily and it normally sort themselves out, then the deadlock theory maybe isnt valid.. What did you mean about "It looks like you have 6 signers. You will have to check them all to make sense of this."? About the chain: I wanted to know what happened basically, i dont know if i can provide any kind of logs or something since the sealers just stoped to wait each other and i dont have any other information. Also, getting this scenario in a production environment sucks, since i cant continue mining..and there is nothing on go-ethereum that guarantees that this will not happen again So, just to make the things more clear, if the block 488675 has different difficulty and different hash doesnt probe that there was an issue? It is normal to have different hashes comparing in-turn with out-turn then? |
Resyncing the signers that you deleted may produce a different distributed state which doesn't deadlock. Or it could deadlock again right away (or at any point in the future). Making fundamental protocol changes to clique like we did for GoChain is necessary to avoid the possibility completely, but can't be applied to an existing chain (without coding in a custom hard fork). You could start a new chain with GoChain instead. |
They all have different views of the chain. You can't be sure why each one was stuck without looking at them all individually. |
Ok, but, what i am looking for? Right now im deleting chain data for all the nodes except 1 and resync the rest of them (5 singers) from that node. About this comment:
If i see two in turn or two out turn with the same difficulty and different hash that will confirm that they think that they have signed recently? |
If they logged that they signed too recently then you can trust that they did. Inspecting the recent blocks would just give you a more complete picture of what exactly happened. |
Well, i delete all the chain data for the 5 sealers and sync from 1 Started to work again but there is a sealer that seems to have connectivity issues or something.. The sealer starts with 6 peers, then goes to 4, 3, 2 then again to 4, 6 , etc... And thats why i suppose the blocks are being lost... and probably thats why the sycnronization fail warning is throwed since is always the same node Any ideas of why is this happening? Connectivity issues since they are separate droplets? Any way to troubleshoot this? Thanks |
I don't think the peer count is related to lost blocks, and neither peers or lost blocks are related to the logical deadlock caused by the same-difficulty ambiguity. Regardless, you can use static/trusted enodes to add the peers automatically. |
I add the nodes mannually, but it is weird that a sealer is always getting connectivity issues with the rest of the peers I will try the static/trusted nodes. I will put the block lost in a separate issue, but i would like to have a response from the geth team about the initial problem, because it seems like i can go into another deadlock issue again Thanks @jmank88 PS: Do you think that the block sealing time can be an issue here? Im using 10 secs |
'Lost blocks' are just blocks that were signed but didn't make the canonical chain. These happen constantly in clique, because most (~1/2) of the signers are eligible to sign at any given time, but only one block is chosen (usually the in-turn signer, with difficulty 2) - all of the other out-of-turn candidates become 'lost blocks'. |
Faster times might increase the chances of bad luck or produce more opportunities for it to go wrong, but the fundamental problem always exists. |
Right, i understand, so, nothing to worry into a PoA network then? About time, yeah, completely agree Thanks a lot! |
I'm not sure what you mean. IMHO the ambiguous difficulty issues are absolutely fatal flaws - the one affecting the protocol itself is much more severe, but the client changes I linked addressed deadlocks as well. |
It's also worth noting that increasing the number of signers may reduce the chance of deadlock, possibly having an odd number rather than even as well. |
Yes sure, i mean, i didnt know about that, but is really good information and i really apreciate it. I was talking about the lost block warning, your explanation make sense for PoA About # of signers, yes, i have read about that, makes sense. I have also implemented a PoC with just 2 sealers and, maybe im lucky, but in 700k blocks i did not experimented this issue. Right now im using a odd number |
Limiting to just 2 signers is a special case with no ambiguous same-difficulty blocks. |
After removing 1 node and resync from the data of 1 of the nodes, i was running the network with 5 sealers without issues. Summary:
After 1 day it got stucked again, but now in weirdest situation:
Last block 503076Sealer 1 Sealer 2 Sealer 4 (off turn with different hash and parent hash) Sealer 5 (off turn, 3 side chain) Sealer 6 (Same hash as sidechain but different parent) The number of signers is 5 Each node is paired with 4 signers and 1 standard node Last block -1: 503075 Sealer 1 (out off turn) Sealer 2 (out off turn, same hash) Sealer 4 (out off turn, different hash, same parent..) Sealer 5 (in turn) Sealer 6 (int turn too) |
You can remove the stack traces, they are not necessary. This looks like a logical deadlock again. Can you double check your screenshots? |
Has this problem been solved? I have the exact same issue with 2 nodes, both sealers (limited resource setup). Even for contract deployment I get a hash and contract address but the contract is not present in chain. A restart of the chain fixes it though. After that all transactions timeout. Both nodes are connected ( |
Are you sure that is the same problem? Because a restart should not fix it. The problem behind this issue makes a fork of the chain |
A restart only makes the initially deployed contract available. Without a restart I get the error "...is chain synced / contract deployed correctly?" I haven't found a fix for further transactions timing out. |
Can we not run a PoA network with 1 sealer? Logs show "Signed recently. Must wait for others" |
Hello , In my private Clique network with 4 Nodes (A,B,C,D) I noticed a fork in the chain for block period 1. I noticed that it happens some times with block period 1 & 2. I noticed that the fork happened at block height 1500 for example. Nodes A & D have a similar chain data meanwhile Nodes B& C have similar data. (Fork occurence) At block 1500, I noticed the difference in data between 2 chains: 1) Block hashes are different 2) Block of one chain fork is an uncle block while for other chain fork has 5000 txs included 3) Both blocks have same difficulty 2 which means that it was mined in turn as well as same sealer(complication) 4) Another complication arises when I noticed it was the same sealer who sealed the block. This results in fork of the network and stalling at the end which cannot undergo any reorg in this deadlock situation. In previous comments I noticed that there was atleast different difficulty and different sealers at the same block height between the forks Please can some one let me know if you faced this issue or a logical explanation of this issue |
Do we have a solution for this PR. We have also encountered the same issue for our network which had 5 signers and worked good for almost 2 months. It suddenly started to show message: We tried to start the mining by using the miner.start( ) function from all the miners/signers but it does not started to mine in the network and 3 of the nodes showed the response something like: INFO [09-05|08:53:23.471] Commit new mining work number=7961999 sealhash="d93ccf…cdb147" uncles=0 txs=0 gas=0 fees=0 elapsed="94.336µs" and rest 2 showed showed the same response with number = 7961998. The amazing thing was transactions were showing different in the txpool Can anyone suggest what should I do that all nodes start mining again? I've tried a few steps and solutions but it did not help. |
just came across this again. here's peter's comment on that matter: ethereum/EIPs#2181 we drafted the EIPs 218{1,2,3} after EthCapeTown for consideration. |
It seems the rinkeby network was stopped working for more than an hour 3 times in only one month:
Are rinkeby down times related to this issue? |
To reproduce deadlock:
With such a configuration, you should have 2-3 deadlocks each hour. |
We experienced such deadlocks on IDChain and solved the issue by running a deadlock resolver script on all sealer nodes that monitor the node and if chain stopped, uses |
But in this case, the last blocks disappear, right? For example, some transactions can be sent through Metamask, Metamask will show that the transaction has been completed, and then the block with this transaction can disappear? |
Transactions should not go away |
Have the same (or similar?) issue after migration from ethash to clique. My scenario is 3 nodes, 7 seconds to mine a new block: get
However, I haven't got yet why deadlock could happen in this case. Block has been mined and shared with the rest of the chain. All the rest nodes accept it and start execute next block calculations on the top of it. In case of race conditions algorithm chooses the block with the longest chain length, time of mining, order in queue, etc. Where is reason for a dead lock? @jmank88 , it seems you have the most experience with this issue and workarounds regarding it. May I ask you to share more information about this, please? |
I haven't dealt with this in a few years, but the general problem was that there is no fully deterministic choice in the case of same length - the tie-breaker was whichever block arrived first, which is subjective due to network latency/partitions/etc. and not representative of when blocks were created. The tie-breaker logic can be patched in the implementation as a workaround to avoid upgrading the difficulty protocol itself. |
@jmank88 , Thank you for more details regarding this issue! |
This was a long standing issue, but with the drop of Clique, we can elegantly close it without addressing it. |
This was undoubtedly the longest-standing issue of my career. |
System information
My current version is:
Expected behaviour
Keep the normal signing .
Actual behaviour
I was running a go-ethereum private network with 6 sealers.
Each sealer is run by:
The blockchain was running good for about 1-2 months.
Today i found that all the nodes were having issues. Each node was emmiting the message "Signed recently, must wait for others"
I check out the logs and i found this message every 1 hour, no more information, the nodes where not mining:
Experimenting the same issue with 6 sealers, i restarted each node but now im get stucked in
The first thing that is weird is that, some nodes are stucked on the 488677 and others are on 488676, this behaviour was reported on this issue #16406 same for the user @lyhbarry
Example:
Signer 1
Signer 2
Note that there is no votes pending
So, right now, i shut down and restar each node, i have found that:
So, the syncronization fail but also i just can start signing again because each node is stucked waiting for the others, that means, the network is useless?
The comment of @tudyzhb on that issue mention that:
After this problem, i take a look at the logs, each signer has this error messages:
Synchronisation failed, dropping peer peer=7875a002affc775b err="retrieved hash chain is invalid"
I also see some:
INFO [01-02|16:58:10.902] 😱 block lost number=488205 hash=1fb1c5…a41a42
This error about hash chain was just a warning, so the node keep mining until the 2th of january, then i saw this on each of the 6 nodes
I was looking that there are a lot of issues about this error, the most similar is the one i posted here but is unresolved.
Most of the issues workarrounds seems to be a restart, but in this case, the chain seems to be is in a unconsistent state and the nodes are always waiting for the others
So,
5 Upgrading from 1.8.17 to 1.8.20 will solve this?
This are other related issues:
#16444 (Same issue but i dont have votes pending in my snapshot)
#14381
#16825
#16406
The text was updated successfully, but these errors were encountered: