-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: zebrad will not reconnect after an internet connection failure and restore #7772
Comments
Thank you for reporting! @ZcashFoundation/zebra-team Let's look into this next sprint |
I think the underlying issue here is our design decision to deal with two competing priorities:
We currently use a fixed time limit after which we delete/ignore failed peers. But an exponential backoff would achieve both goals, it's the typical solution to these kinds of rate-limiting / recovery issues. |
Also, it should be a randomised exponential backoff. If it's not randomised, and a significant part of the network goes down at the same time, there could be a load spike when it comes back up. |
@arya2 let's try to work out the scope of this bug fix before you start work? And a rough design? In some previous Looking at the code, I think there are two related bugs here. The first bug is changing the peer reconnection limit constant to a randomised exponential backoff. That needs around 4 small code changes to fix. The second bug is that peers without last seen times never get attempted. This is a single line fix. There will be other trivial code changes, documentation changes, and tests. But I am expecting this to be a small PR. What do you think? Is there anything else that is required to fix this issue? |
This probably didn't affect all of the peer candidates, and There are valid candidates (I just checked in the
I think the specific problem should be a one-line fix, maybe a missing But that we should split out any related issues. |
Actually I think this is the correct behaviour, because it's "failed peers without a last seen time". If a peer address has never had a connection from us, and we got its address from a seeder, Version message, inbound connection IP, or the peers file, it's probably not a valid peer any more. Let's document this so we don't try to fix it again 😂
There's some unusual behaviour in the logs, so I want to be careful we're not making too many assumptions. If all the peers are considered invalid for gossip, the peer cache file doesn't get updated. So we can't assume too much based on that file either.
@autotunafish how long was the network connection down for? More than 2 minutes, more than 3 hours, and more than 3 days are the important times where Zebra's behaviour changes. @arya2 I think the
How do you know there are valid candidates? There should be at least 3 running tasks that periodically create demand (peer crawler, block syncer, mempool queries). Maybe having no connected peers cancels the demand signal before it gets sent on the channel, or after it is picked up from the channel? Do you know if the timer crawl is adding regular demand? Can you see it in the logs? (Reading the code is helpful but not enough. There's a hang or task dependency happening here, so async code could be waiting for other async code at any await point.)
Here's what I see in the logs:
At the start of the last 25 minutes (where there isn't any peer activity logged), there is this log where all 60 peers have failed:
To be available for connection, these peers need to:
It seems like these conditions should be satisfied, since we have logs of new blocks from peers 26 minutes ago. But it would be great to have more confirmation.
@arya2 can you make a list of all the issues we've mentioned so far, so we can confirm them and prioritise them? |
I checked that it reconnects to peers after losing network access for a few minutes with |
Hey team! Please add your planning poker estimate with Zenhub @arya2 @oxarbitrage @teor2345 @upbqdn |
@mpguerra I don't think this ticket has a defined scope yet?
I was waiting for this list of issues, then we'd work out which ones need to be fixed to fix this bug. After we have a list of fixes I can do an estimate.
I was also waiting for an answer to this question to help prioritise the issues. |
autotunafish says that the network was down for up to 10 minutes, and there were no clock changes. So this seems like a high or medium priority usability/reliability bug. And Arya's explanation seems to match the behaviour in the logs. Let's split anything about 3 hour or 3 day limits into other tickets? |
What happened?
I expected to see this happen: zebrad re-establishes a working connection after the internet connection is re-established.
Instead, this happened: zebrad continuously warns that 'chain updates have stalled, state height has not increased for # minutes'. Restarting the node fixes the issue.
What were you doing when the issue happened?
Disconnecting and reconnecting the router attached to the devices from the internet (I'm filing this issue from the device in question at the same time zebrad is not reconnecting).
Zebra logs
https://gist.github.com/autotunafish/30959aa159bf301c38af8f4480a206e2
peers
https://gist.github.com/autotunafish/79fdcdfcb20ce1b4eb6a27f24885babd
Zebra Version
1.3.0
Which operating systems does the issue happen on?
OS details
Linux john-Satellite-L775D 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Additional information
No response
The text was updated successfully, but these errors were encountered: