optimize atx sync codepath #4977

dshulyak · 2023-09-07T15:58:44Z

it requests and downloads all activation ids that are known to the node.

there main problem is that it scans database on every request, which should be mitigated by implementing smarter caching #4164 . such caching should keep all activations for the epoch in cache, and not evict them on lru strategy.

secondary problem is amount of traffic that it adds, this is less straightforward to solve. maybe we should consider to drop atx sync and instead everyone should regossip its own atx every 30m.

ideally we should implement this before atx sync starts in next epoch.

dshulyak · 2023-09-20T10:19:59Z

i will solve by using in-memory cache with all atxs #5013

dshulyak · 2023-09-27T12:59:57Z

i will try to disable atx sync in next epoch and use only atx regosipping

dshulyak · 2023-10-06T14:05:23Z

add a configurable rate limiter to every p2p rpc
prioritize responsive peers for queries

closes: #5127 #5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: #4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h

…emeshos#5143) closes: spacemeshos#5127 spacemeshos#5036 peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer: - request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency - latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs). related: spacemeshos#4977 limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once. synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h

closes: #4977 closes: #4603 this change introduces two configuration parameter for every server: - requests per interval pace, for example 10 req/s, this caps the maximum bandwidth that every server can use - queue size, it is set to serve requests within expected latency. every other request is dropped immediately so that client can retry with different node. currently the timeout is set to 10s, so the queue should be roughly 10 times larger then rps it doesn't provide global limit for bandwidth, but we have limit for the number of peers. and honest peer doesn't run many concurrent queries. so what we really want to handle is peers with intentionally malicious behavior, but thats not a pressing issue example configuration: ```json "fetch": { "servers": { "ax/1": {"queue": 10, "requests": 1, "interval": "1s"}, "ld/1": {"queue": 1000, "requests": 100, "interval": "1s"}, "hs/1": {"queue": 2000, "requests": 200, "interval": "1s"}, "mh/1": {"queue": 1000, "requests": 100, "interval": "1s"}, "ml/1": {"queue": 100, "requests": 10, "interval": "1s"}, "lp/2": {"queue": 10000, "requests": 1000, "interval": "1s"} } } ``` https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/fetch/fetch.go#L130-L144 metrics are per server: https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/p2p/server/metrics.go#L15-L52 have to be enabled for all servers with ```json "fetch": { "servers-metrics": true } ```

dshulyak added area/atx area/sync labels Sep 7, 2023

dshulyak added this to Dev team kanban Sep 7, 2023

github-project-automation bot moved this to 📋 Backlog in Dev team kanban Sep 7, 2023

dshulyak moved this from 📋 Backlog to 🔖 Next in Dev team kanban Sep 8, 2023

dshulyak self-assigned this Sep 20, 2023

dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Sep 22, 2023

dshulyak moved this from 🏗 Doing to 🔖 Next in Dev team kanban Sep 22, 2023

dshulyak mentioned this issue Sep 25, 2023

list atxs from cached epochs from memory in atx sync #5071

Closed

dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Sep 25, 2023

dshulyak moved this from 🏗 Doing to 📋 Backlog in Dev team kanban Oct 3, 2023

dshulyak moved this from 📋 Backlog to 🔖 Next in Dev team kanban Oct 6, 2023

This was referenced Oct 6, 2023

Sync stuck despite having many peers #5127

Closed

sync should not turn off consensus #5036

Closed

dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Oct 10, 2023

dshulyak mentioned this issue Oct 12, 2023

[Merged by Bors] - sync: prioritize peers with higher success rate and low latency #5143

Closed

dshulyak mentioned this issue Oct 13, 2023

[Merged by Bors] - sync: enable rate limiting for servers #5151

Closed

lrettig mentioned this issue Oct 17, 2023

Reduce required node bandwidth spacemeshos/pm#279

Closed

3 tasks

bors bot closed this as completed in adb2849 Oct 22, 2023

github-project-automation bot moved this from 🏗 Doing to ✅ Done in Dev team kanban Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize atx sync codepath #4977

optimize atx sync codepath #4977

dshulyak commented Sep 7, 2023 •

edited

Loading

dshulyak commented Sep 20, 2023

dshulyak commented Sep 27, 2023

dshulyak commented Oct 6, 2023

optimize atx sync codepath #4977

optimize atx sync codepath #4977

Comments

dshulyak commented Sep 7, 2023 • edited Loading

dshulyak commented Sep 20, 2023

dshulyak commented Sep 27, 2023

dshulyak commented Oct 6, 2023

dshulyak commented Sep 7, 2023 •

edited

Loading