Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync should not turn off consensus #5036

Closed
lrettig opened this issue Sep 19, 2023 · 8 comments
Closed

sync should not turn off consensus #5036

lrettig opened this issue Sep 19, 2023 · 8 comments
Assignees

Comments

@lrettig
Copy link
Member

lrettig commented Sep 19, 2023

Description

I'm running two nodes (both v1.1.6), both see exactly the same layer data, but one thinks it's synced and one doesn't:

> grpcurl -plaintext localhost:9092 spacemesh.v1.NodeService.Status
{
  "status": {
    "connectedPeers": "44",
    "syncedLayer": {
      "number": 19386
    },
    "topLayer": {
      "number": 19395
    },
    "verifiedLayer": {
      "number": 19385
    }
  }
}
> grpcurl -plaintext localhost:9192 spacemesh.v1.NodeService.Status
{
  "status": {
    "connectedPeers": "1",
    "isSynced": true,
    "syncedLayer": {
      "number": 19386
    },
    "topLayer": {
      "number": 19395
    },
    "verifiedLayer": {
      "number": 19385
    }
  }
}

The one that thinks it's out of sync keeps printing:

2023-09-19T12:17:50.296-0400 INFO abcde.sync node is too far behind {"node_id": "abcde", "module": "sync", "sessionId": "06dadc41-5e8b-4f6a-8e67-7baff90bb12c", "current": "19395", "last synced": "19389", "behind threshold": 3, "name": "sync"}

The only difference between them is that the "in sync" node is a private, local peer connected only to the other node, and the "out of sync" node is a public node with 40-50 peers. Unclear if this is related.

@lrettig lrettig added the bug label Sep 19, 2023
@lrettig
Copy link
Member Author

lrettig commented Sep 19, 2023

logfile for "out of sync" node:

log.txt.gz

@dshulyak
Copy link
Contributor

dshulyak commented Sep 19, 2023

i think this is a big problem. sync should be very conservative when it can interrupt consensus. in fact it should almost never interrupt it. current heuristic of 3 layers is no good, or something else makes it so that it doesn't work.
in latest outage it makes some nodes change sync state consistently.

image

@dshulyak
Copy link
Contributor

lets change to not synced only if node was offline (0 peers reported by libp2p) for 30 minutes. in all other cases sync should not interrupt consensus

@dshulyak dshulyak changed the title isSynced doesn't seem to depend on syncedLayer sync should not turn off consensus Sep 20, 2023
@lrettig
Copy link
Member Author

lrettig commented Sep 20, 2023

looks like all private nodes are in sync while public ones aren't (https://discord.com/channels/623195163510046732/1141739980306272427/1153806007785496698, also: https://discord.com/channels/623195163510046732/1141739980306272427/1153911699691282482)

I noticed this behavior as well (before the patch, on v1.1.6). do we have any idea why this might've been the case?

bors bot pushed a commit that referenced this issue Sep 21, 2023
…5040)

related: #5036

in future we should drop it completely, and use only connectivity information to decide if node should stop participating
in consensus. there should be no risk of interrupting consensus, because of any unexpected failures in sync process.
@dshulyak
Copy link
Contributor

i think thats because atx sync queries are slow, and it caused sync to be delayed and fall out of expected window. below you can see that problems stopped when epoch started, and atx sync stopped.

private query public and all queries are fast, but public queries random peers on the network and responses are unpredictable. i think this just highlighted that we should not have such risky heuristics.

image

@lrettig
Copy link
Member Author

lrettig commented Sep 22, 2023

For the record this is what @tal-m said - in order for a node to participate in consensus:

It definitely needs to have timing information for the Hare rounds in the current instance, and IIRC for voting on proposals it needs information about the current state and the recent transactions in the mempool (to compute "conservative balance").
For generating ballots it needs information about the recent Hare instances, and if there are layers older than hdist that haven't passed the confidence threshold it also needs all the most recent ballots.
Of course, for doing anything it needs to be able to evaluate eligibility, so it needs to know the active set and any malfeasance proofs.

@dshulyak
Copy link
Contributor

i was advocating for something similar in this topic #4504 (comment)

but i think think hare can have more relaxed conditions, unlike tortoise hare can't vote against, so it is the same as not voting at all. i think we will remove IsSynced condition all together, and rely on specific data being available.

but what is more important is that:

  • sync can't stop consensus (maybe what i changed is enough, but i would remove this completely and turn off consensus only if i have 0 peers for longer than 20-30 minutes)
  • additionally there is a persistent issue about selecting responsive peers, which is also quite important

@dshulyak dshulyak self-assigned this Oct 6, 2023
@dshulyak
Copy link
Contributor

dshulyak commented Oct 6, 2023

likely same problem as #4977

bors bot pushed a commit that referenced this issue Oct 13, 2023
closes: #5127 #5036

peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer:
- request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency
- latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs).

related: #4977

limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once.

synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
bors bot pushed a commit that referenced this issue Oct 13, 2023
closes: #5127 #5036

peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer:
- request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency
- latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs).

related: #4977

limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once.

synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
@github-project-automation github-project-automation bot moved this from 🔖 Next to ✅ Done in Dev team kanban Oct 13, 2023
dshulyak added a commit to dshulyak/go-spacemesh that referenced this issue Oct 13, 2023
…emeshos#5143)

closes: spacemeshos#5127 spacemeshos#5036

peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer:
- request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency
- latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs).

related: spacemeshos#4977

limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once.

synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants