Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

[client/network] Remove unused argument #13444

Merged

Conversation

melekes
Copy link
Contributor

@melekes melekes commented Feb 23, 2023

Refs https://github.com/paritytech/infrastructure-alerts/issues/26

  • remove unused ban argument
  • improve 💔 Called on_block_justification with a bad peer ID error message

```
sync: Too many full nodes, rejecting 12D3KooWSQAP2fh4qBkLXBW4mvCtbAiK8sqMnExWHHTZtVAxZ8bQ
sync: 12D3KooWSQAP2fh4qBkLXBW4mvCtbAiK8sqMnExWHHTZtVAxZ8bQ disconnected
```

is enough to understand that we've refused to connect to the given peer
@melekes melekes changed the title Anton/13295 sync issues [client/network] Ban peer after disconnecting Feb 23, 2023
@altonen altonen requested a review from a team February 23, 2023 07:54
@altonen
Copy link
Contributor

altonen commented Feb 23, 2023

I wonder why the banning mechanism was disabled. Could you add a test case showing that the peer is actually banned when we disconnect from it?

@melekes
Copy link
Contributor Author

melekes commented Feb 23, 2023

I wonder why the banning mechanism was disabled.

I think only @tomaka knows.

Could you add a test case showing that the peer is actually banned when we disconnect from it?

sure

@tomaka
Copy link
Contributor

tomaka commented Feb 23, 2023

A ban is not actually about banning the peer completely, it is simply an indication to not establish an outgoing substream again to the same peer.

The peer is still allowed to establish outgoing connections to us. This was completely intended and has always been.

This whole thing has had emergent behaviors in the past, where peers keep banning each other or spam each other and use tons of CPU. Even though the nodes were working as intended, the combination of the behavior of both sides led to issues that we didn't expect. I can't stress enough that modifications to this banning behavior will probably have issues that are very hard to predict.
I would in general strongly recommend to not modify the logic in behavior.rs unless you really really know what you're doing and/or are ready to extensively test things and look at graphs for hours.

@melekes
Copy link
Contributor Author

melekes commented Feb 24, 2023

This whole thing has had emergent behaviors in the past, where peers keep banning each other or spam each other and use tons of CPU.

well, that's happening right now because the remote peer does not get banned and immediately tries to open a connection to us. this leads to an endless loop if the node has too many peers.

I would in general strongly recommend to not modify the logic in behavior.rs unless you really really know what you're doing and/or are ready to extensively test things and look at graphs for hours.

reasonable concern 👍

@tomaka
Copy link
Contributor

tomaka commented Feb 24, 2023

well, that's happening right now because the remote peer does not get banned and immediately tries to open a connection to us. this leads to an endless loop if the node has too many peers.

The intended solution to that is that it's the remote that politely makes sure to not try connecting again if we are full. In other words, if A opens a substream to B and B is full then A should ban B and shouldn't try again. This is normally already the case in the code.

A node needs outgoing substreams in order to be guaranteed to function properly, which is why making sure that it is able to find outgoing substreams to well-behaving nodes is important. You can't function with inbound substreams alone, because that would make you vulnerable to eclipse attacks.

Accepting inbound substreams on the other hand is nothing more than a service that it graciously provides. A node doesn't need to protect itself from inbound substreams, because it's not a problem if a misbehaving/malicious node connects.

You could say that banning bad peers that establish substreams to us help freeing slots for the good peers. However, this is no way a protection against an attacker that wants to squat a node. We went with the opinion that since there shouldn't be buggy peers and that you can't guard against malicious peers, then there's also no need to create defenses against them. Instead, energy is better spent avoiding/fixing buggy peers in the first place.

@melekes
Copy link
Contributor Author

melekes commented Feb 24, 2023

this leads me to a question: why is our node tries to establish outgoing connections if it knows it's full (all 40 slots are occupied)?

@tomaka
Copy link
Contributor

tomaka commented Feb 24, 2023

this leads me to a question: why is our node tries to establish outgoing connections if it knows it's full (all 40 slots are occupied)?

It shouldn't, but I feel like there might be some confusion.

All nodes are allowed to "connect" to any other node at any time, as in TCP/WebRTC/QUIC/etc. connections.
This is necessary in order for the DHT, the collations, etc. to work.

There's a limit to the number of incoming "connections" but it's very large (several thousands), and there's no limit to the number of outgoing connections.

The slots (the 40 slots or whatever) and banning system only apply to block announces substreams. There are N out slots, meaning that we'll have only up to 40 outgoing block announces substreams. There are also N in slots, meaning that we'll only accept up to 40 incoming block announces.
The transactions and grandpa substreams are also obligated to follow the block announces substream. In other words we only open/maintain transactions & grandpa substreams with the same peers that already have a block announces substream. So these two are also concerned, but indirectly.
The slots and banning system do not apply to other substreams (kademlia, identify, ping, collation, block requests, etc.)

Note that in my answers above I make sure to talk about substreams, because the term "connection" is vague and confusing.

@melekes melekes changed the title [client/network] Ban peer after disconnecting [client/network] Remove unused argument Feb 27, 2023
@melekes melekes marked this pull request as ready for review February 27, 2023 08:44
@melekes melekes self-assigned this Feb 27, 2023
@melekes melekes added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Feb 27, 2023
@melekes
Copy link
Contributor Author

melekes commented Feb 27, 2023

There's a limit to the number of incoming "connections" but it's very large (several thousands), and there's no limit to the number of outgoing connections.
The slots (the 40 slots or whatever) and banning system only apply to block announces substreams.

Thanks for explaining this @tomaka 🙏

@melekes melekes requested a review from a team February 27, 2023 08:50
@melekes melekes removed the request for review from a team February 27, 2023 08:50
client/network/src/protocol.rs Outdated Show resolved Hide resolved
@dmitry-markin dmitry-markin requested a review from a team February 27, 2023 09:04
Copy link
Contributor

@altonen altonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please restore the removed print.

I don't have access to the logs of the issue but if there are hundreds of messages per second from one peer, we need to look into why does that peer not backoff after getting rejected but instead keeps trying without delay. If that is the case, it doesn't sound like correct behavior to me.

@melekes
Copy link
Contributor Author

melekes commented Feb 27, 2023

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 4ad96cf into paritytech:master Feb 27, 2023
@melekes melekes deleted the anton/13295-sync-issues branch February 27, 2023 10:43
ukint-vs pushed a commit to gear-tech/substrate that referenced this pull request Apr 10, 2023
* improve error message

* removed unused argument

* docs: disconnect_peer_inner no longer accepts `ban`

* remove redundant trace message

```
sync: Too many full nodes, rejecting 12D3KooWSQAP2fh4qBkLXBW4mvCtbAiK8sqMnExWHHTZtVAxZ8bQ
sync: 12D3KooWSQAP2fh4qBkLXBW4mvCtbAiK8sqMnExWHHTZtVAxZ8bQ disconnected
```

is enough to understand that we've refused to connect to the given peer

* Revert "removed unused argument"

This reverts commit c87f755.

* ban peer for 10s after disconnect

* do not accept incoming conns if peer was banned

* Revert "do not accept incoming conns if peer was banned"

This reverts commit 7e59d05.

* Revert "ban peer for 10s after disconnect"

This reverts commit 3859201.

* Revert "Revert "removed unused argument""

This reverts commit f1dc623.

* format code

* Revert "remove redundant trace message"

This reverts commit a87e65f.
nathanwhit pushed a commit to nathanwhit/substrate that referenced this pull request Jul 19, 2023
* improve error message

* removed unused argument

* docs: disconnect_peer_inner no longer accepts `ban`

* remove redundant trace message

```
sync: Too many full nodes, rejecting 12D3KooWSQAP2fh4qBkLXBW4mvCtbAiK8sqMnExWHHTZtVAxZ8bQ
sync: 12D3KooWSQAP2fh4qBkLXBW4mvCtbAiK8sqMnExWHHTZtVAxZ8bQ disconnected
```

is enough to understand that we've refused to connect to the given peer

* Revert "removed unused argument"

This reverts commit c87f755.

* ban peer for 10s after disconnect

* do not accept incoming conns if peer was banned

* Revert "do not accept incoming conns if peer was banned"

This reverts commit 7e59d05.

* Revert "ban peer for 10s after disconnect"

This reverts commit 3859201.

* Revert "Revert "removed unused argument""

This reverts commit f1dc623.

* format code

* Revert "remove redundant trace message"

This reverts commit a87e65f.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants