-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504
Comments
i run the test to explore this behavior. by launching 20 nodes and interupting them all for 30s, every 30m chaos mesh template additionally i set both out of sync and gossip sync threshold to 0 change. this is not important for tortoise behavior, but it helps to get into normal operation faster. layer 78 is empty. this is where nodes were restarted so the tortoise works as expected and didn't had to switch into full mode to make progress. however there are other not nice things that stall network progress that we should improve:
|
thanks for sharing @dshulyak
how does a node differentiate a network restart from a local one? even tho the state may be the same with neighbors, it most likely missed some gossip messages. the node's sync state is initialized to another issue is the
once syncer compare state with neighbors and are convinced that everyone has the same state, yes. just to double-check, when you say comparing state, you mean accumulated layer hash (vs vm state)? |
better if node just asks for missing data without waiting for fetcher to get it. also fetcher will not download all ballots as they are not referenced
i think hare can participate immediately, as long as it has beacon, and active set in consensus. so gossipSynced doesn't make much sense for it. i am actually not sure when do we need gossipSynced state. i think it is also not useful for atxs and beacon
yes, it is the same. and yes i meant consensus (accumulated, mesh) hash |
@countvonzero do you think we need gossip synced state? i think we should do something to enable fast restarts. given that we drop messages when node is not synced, any urgent upgrade will halt network for several layers, and in some configurations it will be extremely painful (like running several public nodes in front of smeshers) |
as is, i don't see value in the only thing in the code that need to change if we remove |
alternatively, we can change |
from my perspective the most important part is to allow node to gossip right after short restart, participating in hare/tortoise is secondary. maybe we can adjust those filters in p2p as a low risk change |
hmmm.... there is some misunderstanding @dshulyak
|
the pattern occurs when all nodes are restarted in the network, example:
the downside is that network had to switch into full mode to verify ballots, with long hdist this period will take quite some time and definitely not optimal
it can be tested in system tests by failing all nodes (don't delete them, as it will also drop the state)
The text was updated successfully, but these errors were encountered: