-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync should not turn off consensus #5036
Comments
logfile for "out of sync" node: |
i think this is a big problem. sync should be very conservative when it can interrupt consensus. in fact it should almost never interrupt it. current heuristic of 3 layers is no good, or something else makes it so that it doesn't work. |
lets change to not synced only if node was offline (0 peers reported by libp2p) for 30 minutes. in all other cases sync should not interrupt consensus |
I noticed this behavior as well (before the patch, on v1.1.6). do we have any idea why this might've been the case? |
i think thats because atx sync queries are slow, and it caused sync to be delayed and fall out of expected window. below you can see that problems stopped when epoch started, and atx sync stopped. private query public and all queries are fast, but public queries random peers on the network and responses are unpredictable. i think this just highlighted that we should not have such risky heuristics. |
For the record this is what @tal-m said - in order for a node to participate in consensus:
|
i was advocating for something similar in this topic #4504 (comment) but i think think hare can have more relaxed conditions, unlike tortoise hare can't vote against, so it is the same as not voting at all. i think we will remove IsSynced condition all together, and rely on specific data being available. but what is more important is that:
|
likely same problem as #4977 |
closes: #5127 #5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: #4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
closes: #5127 #5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: #4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
…emeshos#5143) closes: spacemeshos#5127 spacemeshos#5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: spacemeshos#4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
Description
I'm running two nodes (both v1.1.6), both see exactly the same layer data, but one thinks it's synced and one doesn't:
The one that thinks it's out of sync keeps printing:
2023-09-19T12:17:50.296-0400 INFO abcde.sync node is too far behind {"node_id": "abcde", "module": "sync", "sessionId": "06dadc41-5e8b-4f6a-8e67-7baff90bb12c", "current": "19395", "last synced": "19389", "behind threshold": 3, "name": "sync"}
The only difference between them is that the "in sync" node is a private, local peer connected only to the other node, and the "out of sync" node is a public node with 40-50 peers. Unclear if this is related.
The text was updated successfully, but these errors were encountered: