Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Sync module has racing condition on some checks #2978

Closed
HarukaMa opened this issue Jan 8, 2024 · 13 comments · Fixed by #3321
Closed

[Bug] Sync module has racing condition on some checks #2978

HarukaMa opened this issue Jan 8, 2024 · 13 comments · Fixed by #3321
Labels
bug Incorrect or unexpected behavior

Comments

@HarukaMa
Copy link
Contributor

HarukaMa commented Jan 8, 2024

🐛 Bug Report

The log (with reduced sync and heartbeat intervals):


2024-01-08T17:30:46.360076Z DEBUG snarkos_node_router::handshake: Connecting to 35.224.50.150:4133...
2024-01-08T17:30:46.498847Z  INFO snarkos_node_router::handshake: Connected to '35.224.50.150:4133'
2024-01-08T17:30:47.318394Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127749
2024-01-08T17:30:47.320073Z TRACE snarkos_node_sync::block_sync: Prepared 50 block requests
2024-01-08T17:30:53.013389Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127750
2024-01-08T17:30:53.015090Z TRACE snarkos_node_sync::block_sync: Prepared 1 block requests
2024-01-08T17:30:53.023921Z  INFO snarkos_node_bft_ledger_service::ledger:

Advanced to block 1127750 at round 2352850 - ab1l7l6x3jej9fzdqwecs96lp6m9sqpverqg4uwkpmgha2r0fjeeszs40f346

2024-01-08T17:30:58.089332Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127751
2024-01-08T17:30:58.091648Z TRACE snarkos_node_sync::block_sync: Prepared 1 block requests
2024-01-08T17:30:58.091869Z DEBUG snarkos_node_router::heartbeat: Connected to 1 peer: [35.224.50.150:4133]
2024-01-08T17:30:58.099467Z  INFO snarkos_node_bft_ledger_service::ledger:

Advanced to block 1127751 at round 2352852 - ab1mq3v8ypsrfepjh35t0thje9l9mv3djm5ls8negr589mkkpw5mqrsxe73qs

2024-01-08T17:31:01.103764Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127751
2024-01-08T17:31:01.104716Z TRACE snarkos_node_sync::block_sync: Prepared 1 block requests
2024-01-08T17:31:04.119788Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127751
2024-01-08T17:31:04.120722Z TRACE snarkos_node_sync::block_sync: Prepared 47 block requests
2024-01-08T17:31:07.695287Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127751
2024-01-08T17:31:07.696388Z TRACE snarkos_node_sync::block_sync: Prepared 0 block requests
2024-01-08T17:31:08.092324Z DEBUG snarkos_node_router::heartbeat: Connected to 1 peer: [35.224.50.150:4133]
2024-01-08T17:31:10.698087Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127751
2024-01-08T17:31:10.699059Z TRACE snarkos_node_sync::block_sync: Prepared 1 block requests
2024-01-08T17:31:14.683545Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127752
2024-01-08T17:31:14.685213Z TRACE snarkos_node_sync::block_sync: Prepared 2 block requests
2024-01-08T17:31:14.705284Z  INFO snarkos_node_bft_ledger_service::ledger:

Advanced to block 1127752 at round 2352854 - ab18upqaram79ewsruj2rgce0v3gp9kvmxcvv9atvy9a5vrquas9yxqgpguut

2024-01-08T17:31:19.283766Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 1138544, canon_height = 1127753
2024-01-08T17:31:19.284845Z TRACE snarkos_node_sync::block_sync: Prepared 1 block requests
2024-01-08T17:31:19.284983Z DEBUG snarkos_node_router::heartbeat: Connected to 1 peer: [35.224.50.150:4133]
2024-01-08T17:31:19.294601Z  INFO snarkos_node_bft_ledger_service::ledger:
Advanced to block 1127753 at round 2352856 - ab1pjjhfy26khq6rlywvvknfxy4d4x3c7r6kckats8q7nw0zvhjpq9s8qqhs7

2024-01-08T17:31:24.324350Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 0, canon_height = 1127754
2024-01-08T17:31:24.324450Z TRACE snarkos_node_sync::block_sync: Prepared 0 block requests
2024-01-08T17:31:24.333724Z  INFO snarkos_node_bft_ledger_service::ledger:

Advanced to block 1127754 at round 2352858 - ab1gt0msxch423xswydku6axhx4dx8knem5jdlf4pkw20wa5gf5t5zsye4j8g

2024-01-08T17:31:24.717928Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 35.224.50.150:4133...
2024-01-08T17:31:24.718075Z  WARN snarkos_node::client::router: The sync pool did not request block 1127755
2024-01-08T17:31:24.718286Z  WARN snarkos_node::client::router: Disconnecting from '35.224.50.150:4133' - Peer '35.224.50.150:4133' sent an invalid block response
2024-01-08T17:31:24.718746Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 35.224.50.150:4133...

I believe this should not happen even with reduced intervals as advancing blocks is pretty slow right now.

Steps to Reproduce

Try to sync. Eventually it will happen. Best observed when there is only 1 connection.

Expected Behavior

This should not happen.

Your Environment

  • testnet3
@HarukaMa HarukaMa added the bug Incorrect or unexpected behavior label Jan 8, 2024
@joske
Copy link
Contributor

joske commented Jan 9, 2024

Hi @HarukaMa, could you reproduce with --verbosity 5?

@joske
Copy link
Contributor

joske commented Jan 10, 2024

As far as I understand, this is the order of events:

  • we are syncing with the only peer, we have sent out BlockRequests up until 1127802 (I think, from calculating the number of sent requests - max 50)
  • we receive BlockResponse 1127755
  • for some unknown reason, the matching BlockRequest is no longer in the map self.requests
  • check_block_response returns The sync pool did not request block 1127755
  • this triggers the removal of all requests to this peer and then the disconnect of the (only) peer with the generic message "invalid block response"

No idea why the request for 1127755 would be gone from the map.

I've not been able to reproduce this so far

@HarukaMa
Copy link
Contributor Author

HarukaMa commented Jan 10, 2024

I think the event timeline is like this:

The log immediately before shows that this is very possible (I should have added more logs to pinpoint but I think this is plausible enough):

2024-01-10T18:15:31.532539Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 0, canon_height = 1150881
2024-01-10T18:15:31.532605Z TRACE snarkos_node_sync::block_sync: Prepared 0 block requests
2024-01-10T18:15:31.549565Z  INFO snarkos_node_bft_ledger_service::ledger:

Advanced to block 1150881 at round 2405730 - ab1ehcn4swnqdn7gqmfpg50w2l8d6rtea6zvz6mvkmaqua82d8q8sfq3rawhv

2024-01-10T18:15:31.549926Z TRACE snarkos_node_router::inbound: Received 'BlockResponse 1150882' from '147.192.99.28:4133'
2024-01-10T18:15:32.041286Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 147.192.99.28:4133...
2024-01-10T18:15:32.041398Z  WARN snarkos_node::client::router: The sync pool did not request block 1150882

the sync job most likely treated the existing requests as timed out and removed them, so when the previous block has been added and the message processing job is trying to process the next response, it will think that the next response wasn't requested.

@HarukaMa
Copy link
Contributor Author

well, tried to add more logs and here it is:

2024-01-10T19:06:09.582771Z ERROR snarkos_node_sync::block_sync: Removed 44 timed out requests: [({147.192.99.28:4133}, 1150896), ({147.192.99.28:4133}, 1150897), ({147.192.99.28:4133}, 1150898), ({147.192.99.28:4133}, 1150899), ({147.192.99.28:4133}, 1150900), ({147.192.99.28:4133}, 1150901), ({147.192.99.28:4133}, 1150902), ({147.192.99.28:4133}, 1150903), ({147.192.99.28:4133}, 1150904), ({147.192.99.28:4133}, 1150905), ({147.192.99.28:4133}, 1150906), ({147.192.99.28:4133}, 1150907), ({147.192.99.28:4133}, 1150908), ({147.192.99.28:4133}, 1150909), ({147.192.99.28:4133}, 1150910), ({147.192.99.28:4133}, 1150911), ({147.192.99.28:4133}, 1150912), ({147.192.99.28:4133}, 1150913), ({147.192.99.28:4133}, 1150914), ({147.192.99.28:4133}, 1150915), ({147.192.99.28:4133}, 1150916), ({147.192.99.28:4133}, 1150917), ({147.192.99.28:4133}, 1150918), ({147.192.99.28:4133}, 1150919), ({147.192.99.28:4133}, 1150920), ({147.192.99.28:4133}, 1150921), ({147.192.99.28:4133}, 1150922), ({147.192.99.28:4133}, 1150923), ({147.192.99.28:4133}, 1150924), ({147.192.99.28:4133}, 1150925), ({147.192.99.28:4133}, 1150926), ({147.192.99.28:4133}, 1150927), ({147.192.99.28:4133}, 1150928), ({147.192.99.28:4133}, 1150929), ({147.192.99.28:4133}, 1150930), ({147.192.99.28:4133}, 1150931), ({147.192.99.28:4133}, 1150932), ({147.192.99.28:4133}, 1150933), ({147.192.99.28:4133}, 1150934), ({147.192.99.28:4133}, 1150935), ({147.192.99.28:4133}, 1150936), ({147.192.99.28:4133}, 1150937), ({147.192.99.28:4133}, 1150940), ({147.192.99.28:4133}, 1150941)]
2024-01-10T19:06:11.425595Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 0, canon_height = 1150895
2024-01-10T19:06:11.425636Z TRACE snarkos_node_sync::block_sync: Prepared 0 block requests
2024-01-10T19:06:11.434921Z  INFO snarkos_node_bft_ledger_service::ledger:

Advanced to block 1150895 at round 2405762 - ab1r6q96ansh5g0xx2lrnymegsnefws8wuh9ak9ysdk6lx27nhjdugq67xq3s

2024-01-10T19:06:11.435120Z TRACE snarkos_node_router::inbound: Received 'BlockResponse 1150896' from '147.192.99.28:4133'
2024-01-10T19:06:11.551574Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 147.192.99.28:4133...
2024-01-10T19:06:11.551691Z  WARN snarkos_node::client::router: The sync pool did not request block 1150896; current outstanding requests list is {1150938: (None, None, {147.192.99.28:4133}), 1150939: (None, None, {147.192.99.28:4133}), 1150942: (None, None, {147.192.99.28:4133}), 1150943: (None, None, {147.192.99.28:4133}), 1150944: (None, None, {147.192.99.28:4133})}
2024-01-10T19:06:11.551907Z  WARN snarkos_node::client::router: Disconnecting from '147.192.99.28:4133' - Peer '147.192.99.28:4133' sent an invalid block response

@HarukaMa
Copy link
Contributor Author

HarukaMa commented Jan 10, 2024

I don't think there is a good way to fix this without refactoring how the incoming block response messages are handled except allowing requesting multiple blocks in one message:
https://github.com/AleoHQ/snarkOS/blob/bf18a30b620f7288efe9ccf327c1c9bc80851617/node/bft/events/src/block_response.rs#L67

we are just spamming multiple BlockRequests when we are trying to sync more than one blocks anyway. But we probably need a separate spam prevention mechanism for that.

@joske
Copy link
Contributor

joske commented Jan 11, 2024

Ah it's the timeout mechanism that removes the request from the map.

A partial fix could be to bump the timestamps of all other BlockRequests when a BlockResponse is successfully handled and/or increase the timeout a bit.

@HarukaMa
Copy link
Contributor Author

The issue might be BlockResponses are handled in a blocking way, so they could be actually "handled" after the message is received for quite a while later.

Actually, I found that when there are multiple sync nodes, the node will try to sync different blocks from different nodes, so increasing max number of blocks in request might not work as well.

@joske
Copy link
Contributor

joske commented Jan 11, 2024

Yes, I meant when the BlockResponse is handled. I've implemented this idea in

https://github.com/eqlabs/snarkOS/tree/fix/sync-timeout

Could you try this out?

@HarukaMa
Copy link
Contributor Author

oh, you meant all other requests, then it might work. lemme try to manually apply that patch

@joske
Copy link
Contributor

joske commented Jan 11, 2024

It's not foolproof, but I think this can help alleviate the problem.

@HarukaMa
Copy link
Contributor Author

Can confirm that there is no more timeouts and disconnections after applying the patch.

@joske
Copy link
Contributor

joske commented Jan 11, 2024

I'll open a PR with this fix then.

@raychu86
Copy link
Contributor

Please see if the issues are still reproducible with the new changes in #3079.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect or unexpected behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants