Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune BlockRef on finalization #3513

Merged
merged 5 commits into from
Mar 17, 2022
Merged

Prune BlockRef on finalization #3513

merged 5 commits into from
Mar 17, 2022

Conversation

arnetheduck
Copy link
Member

Up til now, the block dag has been using BlockRef, a structure adapted
for a full DAG, to represent all of chain history. This is a correct and
simple design, but does not exploit the linearity of the chain once
parts of it finalize.

By pruning the in-memory BlockRef structure at finalization, we save,
at the time of writing, a cool ~250mb (or 25%:ish) chunk of memory
landing us at a steady state of ~750mb normal memory usage for a
validating node.

Above all though, we prevent memory usage from growing proportionally
with the length of the chain, something that would not be sustainable
over time - instead, the steady state memory usage is roughly
determined by the validator set size which grows much more slowly. With
these changes, the core should remain sustainable memory-wise post-merge
all the way to withdrawals (when the validator set is expected to grow).

In-memory indices are still used for the "hot" unfinalized portion of
the chain - this ensure that consensus performance remains unchanged.

What changes is that for historical access, we use a db-based linear
slot index which is cache-and-disk-friendly, keeping the cost for
accessing historical data at a similar level as before, achieving the
savings at no percievable cost to functionality or performance.

A nice collateral benefit is the almost-instant startup since we no
longer load any large indicies at dag init.

The cost of this functionality instead can be found in the complexity of
having to deal with two ways of traversing the chain - by BlockRef and
by slot.

  • use BlockId instead of BlockRef where finalized / historical data
    may be required
  • simplify clearance pre-advancement
  • remove dag.finalizedBlocks (~50:ish mb)
  • remove getBlockAtSlot - use getBlockIdAtSlot instead
  • parent and atSlot for BlockId now require a ChainDAGRef
    instance, unlike BlockRef traversal
  • prune BlockRef parents on finality (~200:ish mb)
  • speed up ChainDAG init by not loading finalized history index
  • mess up light client server error handling - this need revisiting :)

Screenshot from 2022-03-11 17-34-08

Up til now, the block dag has been using `BlockRef`, a structure adapted
for a full DAG, to represent all of chain history. This is a correct and
simple design, but does not exploit the linearity of the chain once
parts of it finalize.

By pruning the in-memory `BlockRef` structure at finalization, we save,
at the time of writing, a cool ~250mb (or 25%:ish) chunk of memory
landing us at a steady state of ~750mb normal memory usage for a
validating node.

Above all though, we prevent memory usage from growing proportionally
with the length of the chain, something that would not be sustainable
over time -  instead, the steady state memory usage is roughly
determined by the validator set size which grows much more slowly. With
these changes, the core should remain sustainable memory-wise post-merge
all the way to withdrawals (when the validator set is expected to grow).

In-memory indices are still used for the "hot" unfinalized portion of
the chain - this ensure that consensus performance remains unchanged.

What changes is that for historical access, we use a db-based linear
slot index which is cache-and-disk-friendly, keeping the cost for
accessing historical data at a similar level as before, achieving the
savings at no percievable cost to functionality or performance.

A nice collateral benefit is the almost-instant startup since we no
longer load any large indicies at dag init.

The cost of this functionality instead can be found in the complexity of
having to deal with two ways of traversing the chain - by `BlockRef` and
by slot.

* use `BlockId` instead of `BlockRef` where finalized / historical data
may be required
* simplify clearance pre-advancement
* remove dag.finalizedBlocks (~50:ish mb)
* remove `getBlockAtSlot` - use `getBlockIdAtSlot` instead
* `parent` and `atSlot` for `BlockId` now require a `ChainDAGRef`
instance, unlike `BlockRef` traversal
* prune `BlockRef` parents on finality (~200:ish mb)
* speed up ChainDAG init by not loading finalized history index
* mess up light client server error handling - this need revisiting :)
if unsafeAddr(state) == unsafeAddr(dag.headState):
unsafeAddr dag.clearanceState
else:
unsafeAddr dag.headState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may introduce weird bugs in a couple months if we ever introduce "ForkedBeaconState" because a hard fork requires us to.

Though at that point we will likely refactor all and see that unsafe use without going through heisenbug head-scratching hunting process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean - both of these are ForkedHashedBeaconState objects - all this function needs to do is return them to any valid / consistent beacon state - on error, SSZ will leave a partially initialized state and we want to avoid letting those partials live in memory

@github-actions
Copy link

github-actions bot commented Mar 17, 2022

Unit Test Results

     12 files     830 suites   35m 4s ⏱️
1 680 tests 1 631 ✔️   49 💤 0
9 801 runs  9 685 ✔️ 116 💤 0

Results for commit 38cfc2e.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one review pass through the changes. Will take a more through look through the light client changes once this PR has been merged.

beacon_chain/consensus_object_pools/blockchain_dag.nim Outdated Show resolved Hide resolved
func getBlockIdAtSlot*(dag: ChainDAGRef, slot: Slot): Opt[BlockSlotId] =
## Retrieve the canonical block at the given slot, or the last block that
## comes before - similar to atSlot, but without the linear scan - may hit
## the database to look up early indices.
if slot == dag.genesis.slot:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe an early reject for slot < dag.genesis.slot. existing logic should handle it fine though (if it's even possible. lots of code just assumes GENESIS_SLOT == 0 and compares to 0 instead of the constant anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very early versions of the spec had GENESIS_SLOT > 0, so having these checks against non-zero is more of a path-dependent outcome than any explicit desire to support non-zero genesis - there's a mix of 0, GENESIS_SLOT and other ways of checking genesis indeed and maybe it would make sense to clean that up into something more coherent but that's a story for a different PR

Comment on lines 289 to 291
let existing = ? dag.getBlockIdAtSlot(bid.slot)
if existing.bid != bid:
return err() # Not part of known / relevant history
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If bid.slot > dag.finalizedHead.slot, I am not sure if this check should be enforced. When bid is from a non-finalized fork that is not currently selected by fork-choice, dag.getBlockIdAtSlot for > dag.finalizedHead.slot could return a different bid, even though the fork may become relevant again later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, 38cfc2e

EpochKey() # The searched-for epoch predates our tail block
else:
EpochKey(epoch: epoch, blck: blck)
let bsi = dag.atSlot(bid, epoch.start_slot - 1).valueOr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially underflow when passing epoch 0 with a bid from a later epoch. Not sure if reachable by callers (function is exported)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, 38cfc2e

let
headRoot = ? db.getHeadBlock()

for blck in db.getAncestorSummaries(headRoot):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: This returns RootedSummary, blck is a bit misleading varname here.

## Determine the earliest block with most sync committee signatures among
## ancestors of `highBlck` with at least `lowSlot` as parent block slot.
## Return `nil` if no block with `MIN_SYNC_COMMITTEE_PARTICIPANTS` is found.
var
maxParticipants = 0
maxBlockRef: BlockRef
maxBlockRef: Opt[BlockId]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are now bid no longer blockRef (varnames)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I left most renames for another day, there's enough noise in here already

beacon_chain/rpc/rest_utils.nim Outdated Show resolved Hide resolved
beacon_chain/rpc/rest_utils.nim Outdated Show resolved Hide resolved
let bsid = dag.getBlockIdAtSlot(current).valueOr:
continue

result.add(bsid.bid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if bsid.bid.slot >= start, no? If there are empty slot it could already be out of range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnetheduck arnetheduck enabled auto-merge (squash) March 17, 2022 15:38
@arnetheduck arnetheduck merged commit 05ffe7b into unstable Mar 17, 2022
@arnetheduck arnetheduck deleted the prune-blockref branch March 17, 2022 17:42
@arnetheduck arnetheduck restored the prune-blockref branch March 17, 2022 17:46
@arnetheduck arnetheduck deleted the prune-blockref branch March 17, 2022 17:46
etan-status added a commit that referenced this pull request Mar 17, 2022
Gracefully handles the new failure modes recently introduced to the DAG
as part of #3513
Data that is deemed to exist but fails to load leads to an error log to
avoid suppressing logic errors accidentally. In `verifyFinalization`
mode, the assertions remain active.
etan-status added a commit that referenced this pull request Mar 17, 2022
Gracefully handles the new failure modes recently introduced to the DAG
as part of #3513
Data that is deemed to exist but fails to load leads to an error log to
avoid suppressing logic errors accidentally. In `verifyFinalization`
mode, the assertions remain active.
arnetheduck added a commit that referenced this pull request Mar 18, 2022
regression from #3513 that did not take tail into consideration when
loading epoch ancestor
arnetheduck added a commit that referenced this pull request Mar 18, 2022
regression from #3513 that did not take tail into consideration when
loading epoch ancestor
etan-status added a commit that referenced this pull request Mar 18, 2022
Gracefully handles the new failure modes recently introduced to the DAG
as part of #3513
Data that is deemed to exist but fails to load leads to an error log to
avoid suppressing logic errors accidentally. In `verifyFinalization`
mode, the assertions remain active.
etan-status added a commit that referenced this pull request Mar 18, 2022
Gracefully handles the new failure modes recently introduced to the DAG
as part of #3513
Data that is deemed to exist but fails to load leads to an error log to
avoid suppressing logic errors accidentally. In `verifyFinalization`
mode, the assertions remain active.
etan-status added a commit that referenced this pull request Mar 19, 2022
Gracefully handles the new failure modes recently introduced to the DAG
as part of #3513
Data that is deemed to exist but fails to load leads to an error log to
avoid suppressing logic errors accidentally. In `verifyFinalization`
mode, the assertions remain active.
etan-status added a commit that referenced this pull request Mar 20, 2022
Gracefully handles the new failure modes recently introduced to the DAG
as part of #3513
Data that is deemed to exist but fails to load leads to an error log to
avoid suppressing logic errors accidentally. In `verifyFinalization`
mode, the assertions remain active.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants