Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: zebrad will not reconnect after an internet connection failure and restore #7772

Closed
1 of 4 tasks
autotunafish opened this issue Oct 18, 2023 · 10 comments · Fixed by #7853 or #7869
Closed
1 of 4 tasks

bug: zebrad will not reconnect after an internet connection failure and restore #7772

autotunafish opened this issue Oct 18, 2023 · 10 comments · Fixed by #7853 or #7869
Assignees
Labels
A-network Area: Network protocol updates or fixes C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-usability Zebra is hard to understand or use

Comments

@autotunafish
Copy link

What happened?

I expected to see this happen: zebrad re-establishes a working connection after the internet connection is re-established.

Instead, this happened: zebrad continuously warns that 'chain updates have stalled, state height has not increased for # minutes'. Restarting the node fixes the issue.

What were you doing when the issue happened?

Disconnecting and reconnecting the router attached to the devices from the internet (I'm filing this issue from the device in question at the same time zebrad is not reconnecting).

Zebra logs

https://gist.github.com/autotunafish/30959aa159bf301c38af8f4480a206e2

peers
https://gist.github.com/autotunafish/79fdcdfcb20ce1b4eb6a27f24885babd

Zebra Version

1.3.0

Which operating systems does the issue happen on?

  • Linux
  • macOS
  • Windows
  • Other OS

OS details

Linux john-Satellite-L775D 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional information

No response

@autotunafish autotunafish added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage labels Oct 18, 2023
@mpguerra mpguerra added this to Zebra Oct 18, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra Oct 18, 2023
@mpguerra
Copy link
Contributor

Thank you for reporting!

@ZcashFoundation/zebra-team Let's look into this next sprint

@teor2345
Copy link
Contributor

teor2345 commented Oct 19, 2023

I think the underlying issue here is our design decision to deal with two competing priorities:

We currently use a fixed time limit after which we delete/ignore failed peers. But an exponential backoff would achieve both goals, it's the typical solution to these kinds of rate-limiting / recovery issues.

@teor2345
Copy link
Contributor

Also, it should be a randomised exponential backoff. If it's not randomised, and a significant part of the network goes down at the same time, there could be a load spike when it comes back up.

@arya2 arya2 self-assigned this Oct 23, 2023
@teor2345
Copy link
Contributor

@arya2 let's try to work out the scope of this bug fix before you start work? And a rough design?

In some previous zebra-network changes, the scope has grown dramatically from the initial ticket. But we just said we wanted to avoid that in the retrospective.

Looking at the code, I think there are two related bugs here. The first bug is changing the peer reconnection limit constant to a randomised exponential backoff. That needs around 4 small code changes to fix.

The second bug is that peers without last seen times never get attempted. This is a single line fix.

There will be other trivial code changes, documentation changes, and tests. But I am expecting this to be a small PR.

What do you think? Is there anything else that is required to fix this issue?

@arya2
Copy link
Contributor

arya2 commented Oct 24, 2023

The second bug is that peers without last seen times never get attempted. This is a single line fix.

This probably didn't affect all of the peer candidates, and MAX_RECENT_PEER_AGE from last_seen_is_recent() is 3 days.

There are valid candidates (I just checked in the TimerCrawl match arm), but it isn't getting to DemandHandshakeOrCrawl, in these logs it seems it's always calling crawl from the TimerCrawl match arm.

What do you think? Is there anything else that is required to fix this issue?

I think the specific problem should be a one-line fix, maybe a missing MorePeers signal.

But that we should split out any related issues.

@teor2345
Copy link
Contributor

teor2345 commented Oct 24, 2023

The second bug is that peers without last seen times never get attempted. This is a single line fix.

Actually I think this is the correct behaviour, because it's "failed peers without a last seen time". If a peer address has never had a connection from us, and we got its address from a seeder, Version message, inbound connection IP, or the peers file, it's probably not a valid peer any more.

Let's document this so we don't try to fix it again 😂

This probably didn't affect all of the peer candidates

There's some unusual behaviour in the logs, so I want to be careful we're not making too many assumptions.

If all the peers are considered invalid for gossip, the peer cache file doesn't get updated. So we can't assume too much based on that file either.

and MAX_RECENT_PEER_AGE from last_seen_is_recent() is 3 days.

@autotunafish how long was the network connection down for? More than 2 minutes, more than 3 hours, and more than 3 days are the important times where Zebra's behaviour changes.
Did you change the clock on your machine any time in the 3 days before this happened? Or was there a daylight saving time change? (This shouldn't impact Linux but we need to check.)

@arya2 I think the MAX_RECENT_PEER_AGE bug should be fixed, because it will impact nodes where the network is down for more than 3 days. But as a separate ticket with a different priority. (Depending on what the answer is here.)

There are valid candidates (I just checked in the TimerCrawl match arm), but it isn't getting to DemandHandshakeOrCrawl, in these logs it seems it's always calling crawl from the TimerCrawl match arm.

How do you know there are valid candidates?

There should be at least 3 running tasks that periodically create demand (peer crawler, block syncer, mempool queries). Maybe having no connected peers cancels the demand signal before it gets sent on the channel, or after it is picked up from the channel?

Do you know if the timer crawl is adding regular demand? Can you see it in the logs?

(Reading the code is helpful but not enough. There's a hang or task dependency happening here, so async code could be waiting for other async code at any await point.)

What do you think? Is there anything else that is required to fix this issue?

I think the specific problem should be a one-line fix, maybe a missing MorePeers signal.

Here's what I see in the logs:

  • Something seems to be stopping the syncer from making requests for the last 25 minutes, it's not even logging its timeout waiting to make the next request
  • The peer crawler is timing out, which is expected, but it should still be adding demand
  • There is no evidence either way about the mempool crawler, it could be running but failing to make demands, or it could be hung or waiting on another task
  • I can't see any evidence that there are valid candidates in the last 25 minutes of the log, but beyond that time there were active peer connections

At the start of the last 25 minutes (where there isn't any peer activity logged), there is this log where all 60 peers have failed:

�[2m2023-10-18T19:50:09.783084Z�[0m �[33m WARN�[0m �[1m�[0m�[1m{�[0m�[3mzebrad�[0m�[2m=�[0m"51bb15c" �[3mnet�[0m�[2m=�[0m"Main"�[1m}�[0m�[2m:�[0m�[1mcrawl_and_dial�[0m�[1m{�[0m�[3mnew_peer_interval�[0m�[2m=�[0m61s�[1m}�[0m�[2m:�[0m�[1mcrawl�[0m�[2m:�[0m �[2mzebra_network::peer_set::set�[0m�[2m:�[0m network request with no peer connections. Hint: check your network connection �[3maddress_metrics�[0m�[2m=�[0mAddressMetrics { responded: 0, never_attempted_gossiped: 0, never_attempted_alternate: 0, failed: 60, attempt_pending: 0, recently_live: 0, recently_stopped_responding: 0, num_addresses: 60, address_limit: 4000 }

To be available for connection, these peers need to:

  • not have an attempt, failure, or connection message within the last 2 minutes, and
  • have a connection message or gossiped last seen time within the last 3 hours (or be in a status other than Failed, but we know they were all failed due to the log)

It seems like these conditions should be satisfied, since we have logs of new blocks from peers 26 minutes ago. But it would be great to have more confirmation.

But that we should split out any related issues.

@arya2 can you make a list of all the issues we've mentioned so far, so we can confirm them and prioritise them?

@arya2
Copy link
Contributor

arya2 commented Oct 25, 2023

How do you know there are valid candidates?

I checked that it reconnects to peers after losing network access for a few minutes with demand_tx.try_send(MorePeers) in the match arm for TimerCrawl in initialize::crawl_and_dial and that it otherwise never matches DemandHandshakeOrCrawl again.

@mpguerra
Copy link
Contributor

Hey team! Please add your planning poker estimate with Zenhub @arya2 @oxarbitrage @teor2345 @upbqdn

@teor2345
Copy link
Contributor

@mpguerra I don't think this ticket has a defined scope yet?

@arya2 can you make a list of all the issues we've mentioned so far, so we can confirm them and prioritise them?

I was waiting for this list of issues, then we'd work out which ones need to be fixed to fix this bug. After we have a list of fixes I can do an estimate.

@autotunafish how long was the network connection down for? More than 2 minutes, more than 3 hours, and more than 3 days are the important times where Zebra's behaviour changes.
Did you change the clock on your machine any time in the 3 days before this happened? Or was there a daylight saving time change? (This shouldn't impact Linux but we need to check.)

I was also waiting for an answer to this question to help prioritise the issues.

@teor2345 teor2345 added P-Medium ⚡ I-hang A Zebra component stops responding to requests I-usability Zebra is hard to understand or use A-network Area: Network protocol updates or fixes labels Oct 25, 2023
@teor2345
Copy link
Contributor

autotunafish says that the network was down for up to 10 minutes, and there were no clock changes.

So this seems like a high or medium priority usability/reliability bug. And Arya's explanation seems to match the behaviour in the logs. Let's split anything about 3 hour or 3 day limits into other tickets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-usability Zebra is hard to understand or use
Projects
Status: Done
4 participants