bug: zebrad will not reconnect after an internet connection failure and restore #7772

autotunafish · 2023-10-18T20:25:12Z

What happened?

I expected to see this happen: zebrad re-establishes a working connection after the internet connection is re-established.

Instead, this happened: zebrad continuously warns that 'chain updates have stalled, state height has not increased for # minutes'. Restarting the node fixes the issue.

What were you doing when the issue happened?

Disconnecting and reconnecting the router attached to the devices from the internet (I'm filing this issue from the device in question at the same time zebrad is not reconnecting).

Zebra logs

https://gist.github.com/autotunafish/30959aa159bf301c38af8f4480a206e2

peers
https://gist.github.com/autotunafish/79fdcdfcb20ce1b4eb6a27f24885babd

Zebra Version

1.3.0

Which operating systems does the issue happen on?

Linux
macOS
Windows
Other OS

OS details

Linux john-Satellite-L775D 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional information

No response

mpguerra · 2023-10-19T09:31:34Z

Thank you for reporting!

@ZcashFoundation/zebra-team Let's look into this next sprint

teor2345 · 2023-10-19T22:40:38Z

I think the underlying issue here is our design decision to deal with two competing priorities:

old incompatible peers should not cause a denial of service attack on the network by continuing to make lots of requests (Security: Running Zebra nodes should eventually stop trying to contact peers that always fail #1865)
if the network goes down Zebra should recover and start syncing again

We currently use a fixed time limit after which we delete/ignore failed peers. But an exponential backoff would achieve both goals, it's the typical solution to these kinds of rate-limiting / recovery issues.

teor2345 · 2023-10-22T20:26:12Z

Also, it should be a randomised exponential backoff. If it's not randomised, and a significant part of the network goes down at the same time, there could be a load spike when it comes back up.

teor2345 · 2023-10-23T23:04:43Z

@arya2 let's try to work out the scope of this bug fix before you start work? And a rough design?

In some previous zebra-network changes, the scope has grown dramatically from the initial ticket. But we just said we wanted to avoid that in the retrospective.

Looking at the code, I think there are two related bugs here. The first bug is changing the peer reconnection limit constant to a randomised exponential backoff. That needs around 4 small code changes to fix.

The second bug is that peers without last seen times never get attempted. This is a single line fix.

There will be other trivial code changes, documentation changes, and tests. But I am expecting this to be a small PR.

What do you think? Is there anything else that is required to fix this issue?

arya2 · 2023-10-24T01:56:02Z

The second bug is that peers without last seen times never get attempted. This is a single line fix.

This probably didn't affect all of the peer candidates, and MAX_RECENT_PEER_AGE from last_seen_is_recent() is 3 days.

There are valid candidates (I just checked in the TimerCrawl match arm), but it isn't getting to DemandHandshakeOrCrawl, in these logs it seems it's always calling crawl from the TimerCrawl match arm.

What do you think? Is there anything else that is required to fix this issue?

I think the specific problem should be a one-line fix, maybe a missing MorePeers signal.

But that we should split out any related issues.

teor2345 · 2023-10-24T20:22:58Z

The second bug is that peers without last seen times never get attempted. This is a single line fix.

Actually I think this is the correct behaviour, because it's "failed peers without a last seen time". If a peer address has never had a connection from us, and we got its address from a seeder, Version message, inbound connection IP, or the peers file, it's probably not a valid peer any more.

Let's document this so we don't try to fix it again 😂

This probably didn't affect all of the peer candidates

There's some unusual behaviour in the logs, so I want to be careful we're not making too many assumptions.

If all the peers are considered invalid for gossip, the peer cache file doesn't get updated. So we can't assume too much based on that file either.

and MAX_RECENT_PEER_AGE from last_seen_is_recent() is 3 days.

@autotunafish how long was the network connection down for? More than 2 minutes, more than 3 hours, and more than 3 days are the important times where Zebra's behaviour changes.
Did you change the clock on your machine any time in the 3 days before this happened? Or was there a daylight saving time change? (This shouldn't impact Linux but we need to check.)

@arya2 I think the MAX_RECENT_PEER_AGE bug should be fixed, because it will impact nodes where the network is down for more than 3 days. But as a separate ticket with a different priority. (Depending on what the answer is here.)

There are valid candidates (I just checked in the TimerCrawl match arm), but it isn't getting to DemandHandshakeOrCrawl, in these logs it seems it's always calling crawl from the TimerCrawl match arm.

How do you know there are valid candidates?

There should be at least 3 running tasks that periodically create demand (peer crawler, block syncer, mempool queries). Maybe having no connected peers cancels the demand signal before it gets sent on the channel, or after it is picked up from the channel?

Do you know if the timer crawl is adding regular demand? Can you see it in the logs?

(Reading the code is helpful but not enough. There's a hang or task dependency happening here, so async code could be waiting for other async code at any await point.)

What do you think? Is there anything else that is required to fix this issue?

I think the specific problem should be a one-line fix, maybe a missing MorePeers signal.

Here's what I see in the logs:

Something seems to be stopping the syncer from making requests for the last 25 minutes, it's not even logging its timeout waiting to make the next request
The peer crawler is timing out, which is expected, but it should still be adding demand
There is no evidence either way about the mempool crawler, it could be running but failing to make demands, or it could be hung or waiting on another task
I can't see any evidence that there are valid candidates in the last 25 minutes of the log, but beyond that time there were active peer connections

At the start of the last 25 minutes (where there isn't any peer activity logged), there is this log where all 60 peers have failed:

�[2m2023-10-18T19:50:09.783084Z�[0m �[33m WARN�[0m �[1m�[0m�[1m{�[0m�[3mzebrad�[0m�[2m=�[0m"51bb15c" �[3mnet�[0m�[2m=�[0m"Main"�[1m}�[0m�[2m:�[0m�[1mcrawl_and_dial�[0m�[1m{�[0m�[3mnew_peer_interval�[0m�[2m=�[0m61s�[1m}�[0m�[2m:�[0m�[1mcrawl�[0m�[2m:�[0m �[2mzebra_network::peer_set::set�[0m�[2m:�[0m network request with no peer connections. Hint: check your network connection �[3maddress_metrics�[0m�[2m=�[0mAddressMetrics { responded: 0, never_attempted_gossiped: 0, never_attempted_alternate: 0, failed: 60, attempt_pending: 0, recently_live: 0, recently_stopped_responding: 0, num_addresses: 60, address_limit: 4000 }

To be available for connection, these peers need to:

not have an attempt, failure, or connection message within the last 2 minutes, and
have a connection message or gossiped last seen time within the last 3 hours (or be in a status other than Failed, but we know they were all failed due to the log)

It seems like these conditions should be satisfied, since we have logs of new blocks from peers 26 minutes ago. But it would be great to have more confirmation.

But that we should split out any related issues.

@arya2 can you make a list of all the issues we've mentioned so far, so we can confirm them and prioritise them?

arya2 · 2023-10-25T05:48:57Z

How do you know there are valid candidates?

I checked that it reconnects to peers after losing network access for a few minutes with demand_tx.try_send(MorePeers) in the match arm for TimerCrawl in initialize::crawl_and_dial and that it otherwise never matches DemandHandshakeOrCrawl again.

mpguerra · 2023-10-25T13:37:01Z

Hey team! Please add your planning poker estimate with Zenhub @arya2 @oxarbitrage @teor2345 @upbqdn

teor2345 · 2023-10-25T19:51:27Z

@mpguerra I don't think this ticket has a defined scope yet?

@arya2 can you make a list of all the issues we've mentioned so far, so we can confirm them and prioritise them?

I was waiting for this list of issues, then we'd work out which ones need to be fixed to fix this bug. After we have a list of fixes I can do an estimate.

@autotunafish how long was the network connection down for? More than 2 minutes, more than 3 hours, and more than 3 days are the important times where Zebra's behaviour changes.
Did you change the clock on your machine any time in the 3 days before this happened? Or was there a daylight saving time change? (This shouldn't impact Linux but we need to check.)

I was also waiting for an answer to this question to help prioritise the issues.

teor2345 · 2023-10-25T20:38:56Z

autotunafish says that the network was down for up to 10 minutes, and there were no clock changes.

So this seems like a high or medium priority usability/reliability bug. And Arya's explanation seems to match the behaviour in the logs. Let's split anything about 3 hour or 3 day limits into other tickets?

autotunafish added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage labels Oct 18, 2023

mpguerra added this to Zebra Oct 18, 2023

github-project-automation bot moved this to 🆕 New in Zebra Oct 18, 2023

arya2 self-assigned this Oct 23, 2023

teor2345 added P-Medium ⚡ I-hang A Zebra component stops responding to requests I-usability Zebra is hard to understand or use A-network Area: Network protocol updates or fixes labels Oct 25, 2023

arya2 mentioned this issue Oct 26, 2023

fix(network): Reconnect with peers after brief network interruption #7853

Merged

6 tasks

This was referenced Oct 27, 2023

Tracking: security: Limit ability of synthetic nodes to take up connection slots. Credit: Ziggurat Team #7822

Open

security: avoid hangs in the peer set and related code #7858

Closed

fix(net): Fix potential network hangs, and reduce code complexity #7859

Merged

mpguerra linked a pull request Oct 27, 2023 that will close this issue

fix(network): Reconnect with peers after brief network interruption #7853

Merged

6 tasks

teor2345 mentioned this issue Oct 30, 2023

fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

Merged

7 tasks

mergify bot closed this as completed in #7869 Nov 2, 2023

github-project-automation bot moved this from 🆕 New to ✅ Done in Zebra Nov 2, 2023

mpguerra removed the S-needs-triage Status: A bug report needs triage label Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: zebrad will not reconnect after an internet connection failure and restore #7772

bug: zebrad will not reconnect after an internet connection failure and restore #7772

autotunafish commented Oct 18, 2023

mpguerra commented Oct 19, 2023

teor2345 commented Oct 19, 2023 •

edited

Loading

teor2345 commented Oct 22, 2023

teor2345 commented Oct 23, 2023

arya2 commented Oct 24, 2023 •

edited

Loading

teor2345 commented Oct 24, 2023 •

edited

Loading

arya2 commented Oct 25, 2023 •

edited

Loading

mpguerra commented Oct 25, 2023

teor2345 commented Oct 25, 2023

teor2345 commented Oct 25, 2023

bug: zebrad will not reconnect after an internet connection failure and restore #7772

bug: zebrad will not reconnect after an internet connection failure and restore #7772

Comments

autotunafish commented Oct 18, 2023

What happened?

What were you doing when the issue happened?

Zebra logs

Zebra Version

Which operating systems does the issue happen on?

OS details

Additional information

mpguerra commented Oct 19, 2023

teor2345 commented Oct 19, 2023 • edited Loading

teor2345 commented Oct 22, 2023

teor2345 commented Oct 23, 2023

arya2 commented Oct 24, 2023 • edited Loading

teor2345 commented Oct 24, 2023 • edited Loading

arya2 commented Oct 25, 2023 • edited Loading

mpguerra commented Oct 25, 2023

teor2345 commented Oct 25, 2023

teor2345 commented Oct 25, 2023

teor2345 commented Oct 19, 2023 •

edited

Loading

arya2 commented Oct 24, 2023 •

edited

Loading

teor2345 commented Oct 24, 2023 •

edited

Loading

arya2 commented Oct 25, 2023 •

edited

Loading