Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize atx sync codepath #4977

Closed
Tracked by #279
dshulyak opened this issue Sep 7, 2023 · 3 comments
Closed
Tracked by #279

optimize atx sync codepath #4977

dshulyak opened this issue Sep 7, 2023 · 3 comments
Assignees

Comments

@dshulyak
Copy link
Contributor

dshulyak commented Sep 7, 2023

it requests and downloads all activation ids that are known to the node.

there main problem is that it scans database on every request, which should be mitigated by implementing smarter caching #4164 . such caching should keep all activations for the epoch in cache, and not evict them on lru strategy.

secondary problem is amount of traffic that it adds, this is less straightforward to solve. maybe we should consider to drop atx sync and instead everyone should regossip its own atx every 30m.

ideally we should implement this before atx sync starts in next epoch.

@github-project-automation github-project-automation bot moved this to 📋 Backlog in Dev team kanban Sep 7, 2023
@dshulyak dshulyak moved this from 📋 Backlog to 🔖 Next in Dev team kanban Sep 8, 2023
@dshulyak
Copy link
Contributor Author

i will solve by using in-memory cache with all atxs #5013

@dshulyak dshulyak self-assigned this Sep 20, 2023
@dshulyak dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Sep 22, 2023
@dshulyak dshulyak moved this from 🏗 Doing to 🔖 Next in Dev team kanban Sep 22, 2023
@dshulyak dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Sep 25, 2023
@dshulyak
Copy link
Contributor Author

i will try to disable atx sync in next epoch and use only atx regosipping

@dshulyak dshulyak moved this from 🏗 Doing to 📋 Backlog in Dev team kanban Oct 3, 2023
@dshulyak dshulyak moved this from 📋 Backlog to 🔖 Next in Dev team kanban Oct 6, 2023
@dshulyak
Copy link
Contributor Author

dshulyak commented Oct 6, 2023

  • add a configurable rate limiter to every p2p rpc
  • prioritize responsive peers for queries

@dshulyak dshulyak moved this from 🔖 Next to 🏗 Doing in Dev team kanban Oct 10, 2023
bors bot pushed a commit that referenced this issue Oct 13, 2023
closes: #5127 #5036

peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer:
- request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency
- latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs).

related: #4977

limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once.

synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
bors bot pushed a commit that referenced this issue Oct 13, 2023
closes: #5127 #5036

peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer:
- request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency
- latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs).

related: #4977

limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once.

synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
dshulyak added a commit to dshulyak/go-spacemesh that referenced this issue Oct 13, 2023
…emeshos#5143)

closes: spacemeshos#5127 spacemeshos#5036

peers that are overwhelmed or generally will not be used for requests. there are two criteria used to select good peer:
- request success rate . success rates within 0.1 (10%) of each other are treated as equal, and in such case we will use latency
- latency. hs/1 protocol used to track latency, as it is the most used protocol and objects served in this protocol are of the same size with several exceptions (active sets, list of malfeasence proofs).

related: spacemeshos#4977

limits number of peers to request data for atxs. previously we were requesting data from all peers atleast once.

synced data 2 times in 90m, previous attempt on my computer was 1 week ago and took 12h
bors bot pushed a commit that referenced this issue Oct 20, 2023
closes: #4977
closes: #4603

this change introduces two configuration parameter for every server:
- requests per interval pace, for example 10 req/s, this caps the maximum bandwidth that every server can use
- queue size, it is set to serve requests within expected latency. every other request is dropped immediately so that client can retry with different node. currently the timeout is set to 10s, so the queue should be roughly 10 times larger then rps

it doesn't provide global limit for bandwidth, but we have limit for the number of peers. and honest peer doesn't run many concurrent queries. so what we really want to handle is peers with intentionally malicious behavior, but thats not a pressing issue 

example configuration:

```json
"fetch": {
        "servers": {
            "ax/1": {"queue": 10, "requests": 1, "interval": "1s"},
            "ld/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "hs/1": {"queue": 2000, "requests": 200, "interval": "1s"},
            "mh/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "ml/1": {"queue": 100, "requests": 10, "interval": "1s"},
            "lp/2": {"queue": 10000, "requests": 1000, "interval": "1s"}
        }
    }
```

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/fetch/fetch.go#L130-L144


metrics are per server:

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/p2p/server/metrics.go#L15-L52

have to be enabled for all servers with

```json
"fetch": {
        "servers-metrics": true
    }
```
bors bot pushed a commit that referenced this issue Oct 20, 2023
closes: #4977
closes: #4603

this change introduces two configuration parameter for every server:
- requests per interval pace, for example 10 req/s, this caps the maximum bandwidth that every server can use
- queue size, it is set to serve requests within expected latency. every other request is dropped immediately so that client can retry with different node. currently the timeout is set to 10s, so the queue should be roughly 10 times larger then rps

it doesn't provide global limit for bandwidth, but we have limit for the number of peers. and honest peer doesn't run many concurrent queries. so what we really want to handle is peers with intentionally malicious behavior, but thats not a pressing issue 

example configuration:

```json
"fetch": {
        "servers": {
            "ax/1": {"queue": 10, "requests": 1, "interval": "1s"},
            "ld/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "hs/1": {"queue": 2000, "requests": 200, "interval": "1s"},
            "mh/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "ml/1": {"queue": 100, "requests": 10, "interval": "1s"},
            "lp/2": {"queue": 10000, "requests": 1000, "interval": "1s"}
        }
    }
```

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/fetch/fetch.go#L130-L144


metrics are per server:

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/p2p/server/metrics.go#L15-L52

have to be enabled for all servers with

```json
"fetch": {
        "servers-metrics": true
    }
```
bors bot pushed a commit that referenced this issue Oct 21, 2023
closes: #4977
closes: #4603

this change introduces two configuration parameter for every server:
- requests per interval pace, for example 10 req/s, this caps the maximum bandwidth that every server can use
- queue size, it is set to serve requests within expected latency. every other request is dropped immediately so that client can retry with different node. currently the timeout is set to 10s, so the queue should be roughly 10 times larger then rps

it doesn't provide global limit for bandwidth, but we have limit for the number of peers. and honest peer doesn't run many concurrent queries. so what we really want to handle is peers with intentionally malicious behavior, but thats not a pressing issue 

example configuration:

```json
"fetch": {
        "servers": {
            "ax/1": {"queue": 10, "requests": 1, "interval": "1s"},
            "ld/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "hs/1": {"queue": 2000, "requests": 200, "interval": "1s"},
            "mh/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "ml/1": {"queue": 100, "requests": 10, "interval": "1s"},
            "lp/2": {"queue": 10000, "requests": 1000, "interval": "1s"}
        }
    }
```

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/fetch/fetch.go#L130-L144


metrics are per server:

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/p2p/server/metrics.go#L15-L52

have to be enabled for all servers with

```json
"fetch": {
        "servers-metrics": true
    }
```
bors bot pushed a commit that referenced this issue Oct 22, 2023
closes: #4977
closes: #4603

this change introduces two configuration parameter for every server:
- requests per interval pace, for example 10 req/s, this caps the maximum bandwidth that every server can use
- queue size, it is set to serve requests within expected latency. every other request is dropped immediately so that client can retry with different node. currently the timeout is set to 10s, so the queue should be roughly 10 times larger then rps

it doesn't provide global limit for bandwidth, but we have limit for the number of peers. and honest peer doesn't run many concurrent queries. so what we really want to handle is peers with intentionally malicious behavior, but thats not a pressing issue 

example configuration:

```json
"fetch": {
        "servers": {
            "ax/1": {"queue": 10, "requests": 1, "interval": "1s"},
            "ld/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "hs/1": {"queue": 2000, "requests": 200, "interval": "1s"},
            "mh/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "ml/1": {"queue": 100, "requests": 10, "interval": "1s"},
            "lp/2": {"queue": 10000, "requests": 1000, "interval": "1s"}
        }
    }
```

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/fetch/fetch.go#L130-L144


metrics are per server:

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/p2p/server/metrics.go#L15-L52

have to be enabled for all servers with

```json
"fetch": {
        "servers-metrics": true
    }
```
bors bot pushed a commit that referenced this issue Oct 22, 2023
closes: #4977
closes: #4603

this change introduces two configuration parameter for every server:
- requests per interval pace, for example 10 req/s, this caps the maximum bandwidth that every server can use
- queue size, it is set to serve requests within expected latency. every other request is dropped immediately so that client can retry with different node. currently the timeout is set to 10s, so the queue should be roughly 10 times larger then rps

it doesn't provide global limit for bandwidth, but we have limit for the number of peers. and honest peer doesn't run many concurrent queries. so what we really want to handle is peers with intentionally malicious behavior, but thats not a pressing issue 

example configuration:

```json
"fetch": {
        "servers": {
            "ax/1": {"queue": 10, "requests": 1, "interval": "1s"},
            "ld/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "hs/1": {"queue": 2000, "requests": 200, "interval": "1s"},
            "mh/1": {"queue": 1000, "requests": 100, "interval": "1s"},
            "ml/1": {"queue": 100, "requests": 10, "interval": "1s"},
            "lp/2": {"queue": 10000, "requests": 1000, "interval": "1s"}
        }
    }
```

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/fetch/fetch.go#L130-L144


metrics are per server:

https://github.com/spacemeshos/go-spacemesh/blob/3cf02146bf27f53c001bffcacffbda05933c27c4/p2p/server/metrics.go#L15-L52

have to be enabled for all servers with

```json
"fetch": {
        "servers-metrics": true
    }
```
@bors bors bot closed this as completed in adb2849 Oct 22, 2023
@github-project-automation github-project-automation bot moved this from 🏗 Doing to ✅ Done in Dev team kanban Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
1 participant