forked from tendermint/tendermint
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gossip data to a peer without valid channel increases cpu usage #4
Comments
3 tasks
Addressed by cometbft/cometbft#241, cometbft/cometbft#244, and cometbft/cometbft#245 |
This was referenced Feb 6, 2023
JimLarson
added a commit
to agoric-labs/tendermint
that referenced
this issue
Feb 8, 2023
See issue informalsystems#4. Compare to their patch informalsystems/tendermint#245.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tendermint version (use
tendermint version
orgit rev-parse --verify HEAD
if installed from source):0.34.23
ABCI app (name for built-in, URL for self-written if it's publicly available):
https://github.com/public-awesome/stargaze
Environment:
What happened:
Currently stargaze mainnet network have multiple reports of increased cpu usage without any meaningful change in our current stack.
After digging a bit we were able to find that
gossipDataRoutine
and specifically thegossipDataForCatchup
method was causing this increase in.In the following snippet if
SendEnvelopeShim
fails, it just immediately retries to gossip the same block part until the peer state changes (different round etc), but it generates more work because is loading block meta and block part from disk.tendermint/consensus/reactor.go
Lines 698 to 710 in e0f68fe
adding a small sleep like in other error checks fixes the problem, like in our fork public-awesome@da5a32f which seemed to reduce the cpu usage.
time.Sleep(conR.conS.config.PeerGossipSleepDuration)
Currently there is no way to know from this method if the peer is valid for sending the packet,
hasChannel
is a private method, but ideally we could save loading from disk if we could check firstpeer.IsValid()
then execute the remaining logic.What you expected to happen:
To add a delay or a check that prevents sending to info to a peer with an invalid state
Have you tried the latest version: yes/no
Yes
How to reproduce it (as minimally and precisely as possible):
Hard to replicate current network conditions as it seems there is some invalid peers in the network causing this issue, but joining the network with a new node will replicate it.
Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):
Config (you can paste only the changes you've made):
node command runtime flags:
Please provide the output from the
http://<ip>:<port>/dump_consensus_state
RPC endpoint for consensus bugsAnything else we need to know:
The text was updated successfully, but these errors were encountered: