Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mitigating problems with the network due to poor atxs propagation and bloated active sets #5366

Closed
3 tasks
dshulyak opened this issue Dec 18, 2023 · 0 comments
Closed
3 tasks

Comments

@dshulyak
Copy link
Contributor

there are two problems contributing to the extreme bandwidth usage. both of them is the result of the abnormal growth in the number of atxs.

the first problem exists due to the process which is called atx sync. this process queries all known atxs in the current/next epoch and then fetches those that are missing from 20 peers (configurable). this process is neither scalable or effective, as it doesn't guarantee that node will download all missing atxs. it still can miss them due to concurrency and data availability.

the second problem is the result of poor atxs propagation and because atx sync doesn't do its job. as a a result many nodes are using slightly different active set. this active set is referenced in the first ballot that node produces. when node receives a proposal, that includes such first ballot, it will ask for full active set if it sees that referenced active set is not stored locally. this is also the reason why people are missing rewards.

short term mitigation

we want to achieve two things:

  • download all on-time atxs without wasting a lot of traffic
  • guarantee that most nodes are using the same active set

we can achieve both of those goals by sharing active set from a our server. it was introduced in codebase for mitigating certain failures in hare oracle, and can be reused to compute tortoise active set.

  • this code needs to be extended with shared activeset. and shared activeset should be saved into the table with active sets.
  • second change is to disable atx sync. and instead ask peers for atxs that are missing in the shared activeset.
  • together with a beacon someone will need to generate activeset as well. and this data will have to be downloaded by all nodes in the network.

this changes will eliminate large requests during consensus phase, and by my estimate will solve most existing issues with proposal getting lost.

long term mitigation

long term we need to get rid of centralized infrastructure. to achieve that atxs propagation has to be reliable, such that any honest node should be able to download atx within 30m (this is configurable).

sync changes for that include better atx sync protocol #5306 and replacing collections request with streams, so that limits are not hard-coded in codec specification #5278.

activeset should be encoded with a diff, rather than re-encoding whole activeset #5282. and additionally we may consider building active deterministically from 1 or several blocks, #5288 (this is a change that can be finished to work properly).

beside sync and activeset problems there are still problems with atx propagation. improvements in that area include:

  1. distributed post post distributed verification #5185 . will improve download speed a lot
  2. merging atxs Support combining identities pm#267 . likely won't help that much, but will help to keep number of proposals in check
  3. distributing atxs over whole epoch

in my opinion sync changes outlined above, deterministically generating activeset and atxs 1 and 2 changes are the most important.

spacemesh-bors bot pushed a commit that referenced this issue Dec 21, 2023
related: #5366

this change disables sync reconciliation at the end and start of epoch, normal atx downloading works the way it was working.
instead it will ask peers for atxs that are recorded in the set, shared by the server.
spacemesh-bors bot pushed a commit that referenced this issue Dec 21, 2023
related: #5366

this change disables sync reconciliation at the end and start of epoch, normal atx downloading works the way it was working.
instead it will ask peers for atxs that are recorded in the set, shared by the server.
spacemesh-bors bot pushed a commit that referenced this issue Dec 21, 2023
related: #5366

this change disables sync reconciliation at the end and start of epoch, normal atx downloading works the way it was working.
instead it will ask peers for atxs that are recorded in the set, shared by the server.
dshulyak added a commit to dshulyak/go-spacemesh that referenced this issue Dec 21, 2023
related: spacemeshos#5366

this change disables sync reconciliation at the end and start of epoch, normal atx downloading works the way it was working.
instead it will ask peers for atxs that are recorded in the set, shared by the server.
spacemesh-bors bot pushed a commit that referenced this issue Dec 21, 2023
…5378)

## Motivation
related to #5366 

Allow to set a fallback active set to be used by the proposal builder via the same mechanism as we use for the fallback beacon.

## Changes
- some minor fixes in `bootstrap` cmd code
- extend `ProposalBuilder` to be able to set a specific active set for an epoch
- extend node routine listening for updates to fallback beacon / activeset to propagate a fallback to the `ProposalBuilder`

## Test Plan
- added test case to `ProposalBuilder` to verify use of fallback active set if available
- manual testing on `testnet-10`

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed
- [x] Update [changelog](../CHANGELOG.md) as needed
dshulyak added a commit that referenced this issue Dec 21, 2023
related: #5366

this change disables sync reconciliation at the end and start of epoch, normal atx downloading works the way it was working.
instead it will ask peers for atxs that are recorded in the set, shared by the server.
spacemesh-bors bot pushed a commit that referenced this issue Dec 21, 2023
…5378)

## Motivation
related to #5366 

Allow to set a fallback active set to be used by the proposal builder via the same mechanism as we use for the fallback beacon.

## Changes
- some minor fixes in `bootstrap` cmd code
- extend `ProposalBuilder` to be able to set a specific active set for an epoch
- extend node routine listening for updates to fallback beacon / activeset to propagate a fallback to the `ProposalBuilder`

## Test Plan
- added test case to `ProposalBuilder` to verify use of fallback active set if available
- manual testing on `testnet-10`

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed
- [x] Update [changelog](../CHANGELOG.md) as needed
dsmello pushed a commit that referenced this issue Dec 28, 2023
related: #5366

this change disables sync reconciliation at the end and start of epoch, normal atx downloading works the way it was working.
instead it will ask peers for atxs that are recorded in the set, shared by the server.
dsmello pushed a commit that referenced this issue Dec 28, 2023
…5378)

## Motivation
related to #5366 

Allow to set a fallback active set to be used by the proposal builder via the same mechanism as we use for the fallback beacon.

## Changes
- some minor fixes in `bootstrap` cmd code
- extend `ProposalBuilder` to be able to set a specific active set for an epoch
- extend node routine listening for updates to fallback beacon / activeset to propagate a fallback to the `ProposalBuilder`

## Test Plan
- added test case to `ProposalBuilder` to verify use of fallback active set if available
- manual testing on `testnet-10`

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed
- [x] Update [changelog](../CHANGELOG.md) as needed
@dshulyak dshulyak closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant