-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Sync module has racing condition on some checks #2978
Comments
Hi @HarukaMa, could you reproduce with |
As far as I understand, this is the order of events:
No idea why the request for 1127755 would be gone from the map. I've not been able to reproduce this so far |
I think the event timeline is like this:
The log immediately before shows that this is very possible (I should have added more logs to pinpoint but I think this is plausible enough):
the sync job most likely treated the existing requests as timed out and removed them, so when the previous block has been added and the message processing job is trying to process the next response, it will think that the next response wasn't requested. |
well, tried to add more logs and here it is:
|
I don't think there is a good way to fix this without refactoring how the incoming block response messages are handled except allowing requesting multiple blocks in one message: we are just spamming multiple |
Ah it's the timeout mechanism that removes the request from the map. A partial fix could be to bump the timestamps of all other BlockRequests when a BlockResponse is successfully handled and/or increase the timeout a bit. |
The issue might be Actually, I found that when there are multiple sync nodes, the node will try to sync different blocks from different nodes, so increasing max number of blocks in request might not work as well. |
Yes, I meant when the BlockResponse is handled. I've implemented this idea in https://github.com/eqlabs/snarkOS/tree/fix/sync-timeout Could you try this out? |
oh, you meant all other requests, then it might work. lemme try to manually apply that patch |
It's not foolproof, but I think this can help alleviate the problem. |
Can confirm that there is no more timeouts and disconnections after applying the patch. |
I'll open a PR with this fix then. |
Please see if the issues are still reproducible with the new changes in #3079. |
🐛 Bug Report
The log (with reduced sync and heartbeat intervals):
I believe this should not happen even with reduced intervals as advancing blocks is pretty slow right now.
Steps to Reproduce
Try to sync. Eventually it will happen. Best observed when there is only 1 connection.
Expected Behavior
This should not happen.
Your Environment
The text was updated successfully, but these errors were encountered: