sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504

dshulyak · 2023-06-11T08:10:29Z

the pattern occurs when all nodes are restarted in the network, example:

layer 100 is verified, and everyone votes consistently on it
no ballots and blocks in layers 101-102
layers starting from 103 will have a block, but ballots in layers 103-105 will vote abstain on previous layers making them uncountable in verifying mode
after layer 105 all opinions should be consistent in the network, and with some number of layers we should be able to cross pessimitic threshold

the downside is that network had to switch into full mode to verify ballots, with long hdist this period will take quite some time and definitely not optimal

it can be tested in system tests by failing all nodes (don't delete them, as it will also drop the state)

dshulyak · 2023-06-16T09:31:24Z

i run the test to explore this behavior. by launching 20 nodes and interupting them all for 30s, every 30m chaos mesh template

additionally i set both out of sync and gossip sync threshold to 0 change. this is not important for tortoise behavior, but it helps to get into normal operation faster.

logs: consensus ballots

layer 78 is empty. this is where nodes were restarted
layer 79 is immediately successful. but ballots abstain on 78, due to limitations of verifying tortoise algorithm we can't count such ballots.
couple of layers after tortoise accumulated enough weight to cross pessimistic threshold, and made progress in verifying mode.

so the tortoise works as expected and didn't had to switch into full mode to make progress. however there are other not nice things that stall network progress that we should improve:

sync and gossip sync thresholds. we should find a better heuristic that will allow nodes to participate in hare and tortoise asap after restrart. once such heuristic might be to ask for peers opinion, and if it matches our own - then just switch into synced immediately. in practice it should allow to recover within seconds
if we switch into synced mode we should also notify tortoise that layers are likely empty, so that it doesn't vote abstain on them. i think that it is ok

cc @countvonzero

countvonzero · 2023-06-16T23:53:08Z

thanks for sharing @dshulyak

sync and gossip sync thresholds ... in practice it should allow to recover within seconds

how does a node differentiate a network restart from a local one? even tho the state may be the same with neighbors, it most likely missed some gossip messages.

the node's sync state is initialized to notSynced. maybe if it's initialized to synced and only changed to notSynced if lags by more than sync threshold (3)? the node should be ok to recursively fetch data missed due to restart for 3 layers.

another issue is the gossipSynced state. the gossiped handlers are chained with gossipSynced state while hare and proposal builder require synced state. does it make sense to participate in hare/tortoise in gossipSynced state?

if we switch into synced mode we should also notify tortoise that layers are likely empty, so that it doesn't vote abstain on them. i think that it is ok

once syncer compare state with neighbors and are convinced that everyone has the same state, yes.
what is the effect of "not vote abstain"? is it the same as "vote empty layer"?

just to double-check, when you say comparing state, you mean accumulated layer hash (vs vm state)?

dshulyak · 2023-07-17T05:37:48Z

the node's sync state is initialized to notSynced. maybe if it's initialized to synced and only changed to notSynced if lags by more than sync threshold (3)? the node should be ok to recursively fetch data missed due to restart for 3 layers.

better if node just asks for missing data without waiting for fetcher to get it. also fetcher will not download all ballots as they are not referenced

another issue is the gossipSynced state. the gossiped handlers are chained with gossipSynced state while hare and proposal builder require synced state. does it make sense to participate in hare/tortoise in gossipSynced state?

i think hare can participate immediately, as long as it has beacon, and active set in consensus. so gossipSynced doesn't make much sense for it.
for tortoise we can participate without causing harm if missed layers are within zdist, so we will just vote abstain until we get certificate

i am actually not sure when do we need gossipSynced state. i think it is also not useful for atxs and beacon

once syncer compare state with neighbors and are convinced that everyone has the same state, yes.
what is the effect of "not vote abstain"? is it the same as "vote empty layer"?

yes, it is the same. and yes i meant consensus (accumulated, mesh) hash

dshulyak · 2023-08-05T18:17:48Z

@countvonzero do you think we need gossip synced state? i think we should do something to enable fast restarts. given that we drop messages when node is not synced, any urgent upgrade will halt network for several layers, and in some configurations it will be extremely painful (like running several public nodes in front of smeshers)

countvonzero · 2023-08-05T18:52:37Z

as is, i don't see value in gossipSync after a separate state was created to listen to atx gossip.

the only thing in the code that need to change if we remove gossipSync state, is to sync to the current instead of current - 1 when the node is syncing.

countvonzero · 2023-08-05T19:01:55Z

alternatively, we can change gossipSync layer from 2 to 1 to be conservative (min code change)?

dshulyak · 2023-08-06T06:10:08Z

alternatively, we can change gossipSync layer from 2 to 1 to be conservative (min code change)?

from my perspective the most important part is to allow node to gossip right after short restart, participating in hare/tortoise is secondary. maybe we can adjust those filters in p2p as a low risk change

countvonzero · 2023-08-06T08:57:35Z

to allow node to gossip right after short restart

hmmm.... there is some misunderstanding @dshulyak
this is already the case. when a node restart, as soon as they get p2p connections, it will enter gossipSync state almost immediately. this state already allows a node to gossip data. the extra 2 layers wait should only delay

hare
block proposal
beacon (if the timing is at the beginning of epoch)

dshulyak added this to Dev team kanban Jun 11, 2023

github-project-automation bot moved this to 📋 Backlog in Dev team kanban Jun 11, 2023

dshulyak moved this from 📋 Backlog to 🔖 Next in Dev team kanban Jun 11, 2023

dshulyak added the milestone/genesis label Jun 11, 2023

dshulyak self-assigned this Jun 14, 2023

dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Jun 14, 2023

dshulyak removed their assignment Jun 16, 2023

dshulyak moved this from 🏗 Doing to 🔖 Next in Dev team kanban Jun 16, 2023

dshulyak changed the title ~~tortoise: don't switch into full mode after several layers of missing ballots and abstain votes~~ sync, tortoise: improve speed and efficiency of participation in protocol after restart Jun 16, 2023

dshulyak added milestone/first block and removed milestone/genesis labels Jul 17, 2023

countvonzero mentioned this issue Aug 5, 2023

sync: remove gossip sync #4780

Closed

dshulyak added area/tortoise and removed milestone/first block labels Aug 28, 2023

dshulyak moved this from 🔖 Next to 📋 Backlog in Dev team kanban Sep 8, 2023

dshulyak mentioned this issue Sep 22, 2023

sync should not turn off consensus #5036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504

sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504

dshulyak commented Jun 11, 2023 •

edited

Loading

dshulyak commented Jun 16, 2023

countvonzero commented Jun 16, 2023

dshulyak commented Jul 17, 2023

dshulyak commented Aug 5, 2023

countvonzero commented Aug 5, 2023

countvonzero commented Aug 5, 2023

dshulyak commented Aug 6, 2023

countvonzero commented Aug 6, 2023

sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504

sync, tortoise: improve speed and efficiency of participation in protocol after restart #4504

Comments

dshulyak commented Jun 11, 2023 • edited Loading

dshulyak commented Jun 16, 2023

countvonzero commented Jun 16, 2023

dshulyak commented Jul 17, 2023

dshulyak commented Aug 5, 2023

countvonzero commented Aug 5, 2023

countvonzero commented Aug 5, 2023

dshulyak commented Aug 6, 2023

countvonzero commented Aug 6, 2023

dshulyak commented Jun 11, 2023 •

edited

Loading