Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504

Open
dshulyak opened this issue Jun 11, 2023 · 8 comments

Comments

@dshulyak
Copy link
Contributor

dshulyak commented Jun 11, 2023

the pattern occurs when all nodes are restarted in the network, example:

  • layer 100 is verified, and everyone votes consistently on it
  • no ballots and blocks in layers 101-102
  • layers starting from 103 will have a block, but ballots in layers 103-105 will vote abstain on previous layers making them uncountable in verifying mode
  • after layer 105 all opinions should be consistent in the network, and with some number of layers we should be able to cross pessimitic threshold

the downside is that network had to switch into full mode to verify ballots, with long hdist this period will take quite some time and definitely not optimal

it can be tested in system tests by failing all nodes (don't delete them, as it will also drop the state)

@github-project-automation github-project-automation bot moved this to 📋 Backlog in Dev team kanban Jun 11, 2023
@dshulyak dshulyak moved this from 📋 Backlog to 🔖 Next in Dev team kanban Jun 11, 2023
@dshulyak dshulyak self-assigned this Jun 14, 2023
@dshulyak dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Jun 14, 2023
@dshulyak
Copy link
Contributor Author

i run the test to explore this behavior. by launching 20 nodes and interupting them all for 30s, every 30m chaos mesh template

additionally i set both out of sync and gossip sync threshold to 0 change. this is not important for tortoise behavior, but it helps to get into normal operation faster.

logs: consensus ballots

layer 78 is empty. this is where nodes were restarted
layer 79 is immediately successful. but ballots abstain on 78, due to limitations of verifying tortoise algorithm we can't count such ballots.
couple of layers after tortoise accumulated enough weight to cross pessimistic threshold, and made progress in verifying mode.

so the tortoise works as expected and didn't had to switch into full mode to make progress. however there are other not nice things that stall network progress that we should improve:

  1. sync and gossip sync thresholds. we should find a better heuristic that will allow nodes to participate in hare and tortoise asap after restrart. once such heuristic might be to ask for peers opinion, and if it matches our own - then just switch into synced immediately. in practice it should allow to recover within seconds
  2. if we switch into synced mode we should also notify tortoise that layers are likely empty, so that it doesn't vote abstain on them. i think that it is ok

cc @countvonzero

@dshulyak dshulyak removed their assignment Jun 16, 2023
@dshulyak dshulyak moved this from 🏗 Doing to 🔖 Next in Dev team kanban Jun 16, 2023
@dshulyak dshulyak changed the title tortoise: don't switch into full mode after several layers of missing ballots and abstain votes sync, tortoise: improve speed and efficiency of participation in protocol after restart Jun 16, 2023
@countvonzero
Copy link
Contributor

thanks for sharing @dshulyak

  1. sync and gossip sync thresholds ... in practice it should allow to recover within seconds

how does a node differentiate a network restart from a local one? even tho the state may be the same with neighbors, it most likely missed some gossip messages.

the node's sync state is initialized to notSynced. maybe if it's initialized to synced and only changed to notSynced if lags by more than sync threshold (3)? the node should be ok to recursively fetch data missed due to restart for 3 layers.

another issue is the gossipSynced state. the gossiped handlers are chained with gossipSynced state while hare and proposal builder require synced state. does it make sense to participate in hare/tortoise in gossipSynced state?

  1. if we switch into synced mode we should also notify tortoise that layers are likely empty, so that it doesn't vote abstain on them. i think that it is ok

once syncer compare state with neighbors and are convinced that everyone has the same state, yes.
what is the effect of "not vote abstain"? is it the same as "vote empty layer"?

just to double-check, when you say comparing state, you mean accumulated layer hash (vs vm state)?

@dshulyak
Copy link
Contributor Author

the node's sync state is initialized to notSynced. maybe if it's initialized to synced and only changed to notSynced if lags by more than sync threshold (3)? the node should be ok to recursively fetch data missed due to restart for 3 layers.

better if node just asks for missing data without waiting for fetcher to get it. also fetcher will not download all ballots as they are not referenced

another issue is the gossipSynced state. the gossiped handlers are chained with gossipSynced state while hare and proposal builder require synced state. does it make sense to participate in hare/tortoise in gossipSynced state?

i think hare can participate immediately, as long as it has beacon, and active set in consensus. so gossipSynced doesn't make much sense for it.
for tortoise we can participate without causing harm if missed layers are within zdist, so we will just vote abstain until we get certificate

i am actually not sure when do we need gossipSynced state. i think it is also not useful for atxs and beacon

once syncer compare state with neighbors and are convinced that everyone has the same state, yes.
what is the effect of "not vote abstain"? is it the same as "vote empty layer"?

yes, it is the same. and yes i meant consensus (accumulated, mesh) hash

@dshulyak
Copy link
Contributor Author

dshulyak commented Aug 5, 2023

@countvonzero do you think we need gossip synced state? i think we should do something to enable fast restarts. given that we drop messages when node is not synced, any urgent upgrade will halt network for several layers, and in some configurations it will be extremely painful (like running several public nodes in front of smeshers)

@countvonzero
Copy link
Contributor

as is, i don't see value in gossipSync after a separate state was created to listen to atx gossip.

the only thing in the code that need to change if we remove gossipSync state, is to sync to the current instead of current - 1 when the node is syncing.

@countvonzero
Copy link
Contributor

alternatively, we can change gossipSync layer from 2 to 1 to be conservative (min code change)?

@dshulyak
Copy link
Contributor Author

dshulyak commented Aug 6, 2023

alternatively, we can change gossipSync layer from 2 to 1 to be conservative (min code change)?

from my perspective the most important part is to allow node to gossip right after short restart, participating in hare/tortoise is secondary. maybe we can adjust those filters in p2p as a low risk change

@countvonzero
Copy link
Contributor

to allow node to gossip right after short restart

hmmm.... there is some misunderstanding @dshulyak
this is already the case. when a node restart, as soon as they get p2p connections, it will enter gossipSync state almost immediately. this state already allows a node to gossip data. the extra 2 layers wait should only delay

  • hare
  • block proposal
  • beacon (if the timing is at the beginning of epoch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants