Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add processing and processed caching to the DA checker #4732

Merged

Conversation

realbigsean
Copy link
Member

@realbigsean realbigsean commented Sep 13, 2023

Issue Addressed

Resolves: #4667

Proposed Changes

===== See Below for Description Additions on 9/21 =====

This PR aims to fix raciness we're observing in single block/blob lookups by:

1. moving state related to delayed lookups out of the networking layer (no longer caching messages in the delay service) and only using the beacon chain's data availability checker

I experimented with moving the entire delay service to the beacon chain. This didn't end up working well. The timing logic needs access to the beacon processor, and the beacon processor can’t be added to the beacon chain without creating a circular dependency. More generally "pushing" to the networking layer from the beacon chain isn’t done anywhere else from what I can tell… so it’s at odds with how we've architected things generally. So in this PR I'm separating the timing logic from the state related to delayed lookups. The network now periodically polls the beacon chain for any block roots it needs to make a lookup request for.

2. expanding what is stored in the data availability checker to include a notion of "processing" and "processed" blocks and blobs.

I was initially thinking we could use the gossip seen caches for this, but this doesn't end up working too well for a couple reasons:

  1. If we're going to rely on this cache to know whether to retry a request, we need to clear "processing" messages on a processing failure.
  2. Presently, our blob seen caches are keyed on (index, block_root). Because there’s no blob_root(no notion of blob uniqueness), the cache would cause us to filter valid blobs if we receive an invalid blob first for the same (index, block_root). There's also no blob slashing condition. This is an open problem in gossip caches (see Blob spam protection mechanism ethereum/consensus-specs#3261). Because this is an open problem in gossip we need to be resilient to this issue in single blob lookups, so we have to be able to request multiple (index, block_root) combinations via RPC. If we implement this mechanic from the linked issue, however, it could work:

clearing seen cache when blob is determined to be invalid post-gossip-validation, thus allowing other blobs for the same block/index to propagate

I talked about this with @ethDreamer though, and it seems like not propagating multiple blobs per (index, block_root) in gossip, while being able to handle it in RPC is probably the way to go to minimize the chance of a malicious proposer's block being made canonical, while still being able to recover if it is made canonical.

=============== Description Additions on 9/21 ==========

Why AvailabilityView?

We end up needing similar logic in a few places where we store parts of a block that we might not be sure are valid while waiting for other parts of the block. The AvailabilityView is to make sure we don't have to re-implement it and will make it easier to update across multiple caches. The logic as currently written is as follows:

  1. If we haven't seen a block yet, we will insert the first blob for a given (block_root, index) but we won't insert subsequent blobs for the same (block_root, index) if they have a different commitment.
  2. On block insertion, any non-matching blob commitments are evicted.

The intent here is to 1. give precedence to earlier messages, and 2. make sure we can recover from messages we eventually discover weren't valid.

The places we use this type of logic:

  • NEW processing cache - cache knowledge of blocks/blobs from gossip verification or rpc download until we import or error. Since adding delayed lookup logic, it's become important to know about blocks/blobs we've seen but haven't finished processing and haven't imported. This is similar in nature to the data availability cache but covers a longer window of processing and includes less validation assurances. Waits until import is complete to evict.
  • data availability cache - where we stored fully verified components (includes execution verification etc. for block) until we are ready to import and evict from this cache without waiting for import
  • child component - when we trigger a single block/blob lookup via an UnknownParent we don't want to keep processing the block/blob because it might be garbage. We need a place to put the block/blob, currently in stable lighthouse we just send these blocks to the network layet. In the deneb branch, we continue this tradition, but it gets more complicated because we are only sending one piece at a time. Hence the introduction of ChildComponents

Why get rid of block/blob "consistency checks"?

I've been thinking about this a decent bit, and I think we were being too strict with our RPC validation. The consensus spec intentionally makes data availability checking very generic:

https://github.com/ethereum/consensus-specs/blob/dev/specs/deneb/fork-choice.md#is_data_available

Pretty much so long as you can kzg verify blobs that match the block, the data is available. Other fields in the BlobSidecar like parent_root, slot, and proposer_index are necessary in gossip, but in RPC, we really on care if the data is correct. If for example a block is accepted as canonical but we miss it on gossip. Later requesting via RPC and failing because an inconsequential field was wrong even if we have the correct data seems unnecessarily strict.

Bonus

Linking this for visibility but I think this piece of delay logic was bugged: #4732 (comment)

@realbigsean realbigsean added work-in-progress PR is a work-in-progress deneb labels Sep 13, 2023
Copy link
Member

@ethDreamer ethDreamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review isn't done - but I had some suggestions

@@ -149,31 +145,12 @@ impl<T: BeaconChainTypes> DataAvailabilityChecker<T> {
pub fn get_missing_blob_ids(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be merged with get_missing_blob_ids_checking_cache since that's the only function that calls this. It would be more clear to have the behavior be:

None -> not sure which blobs we require
Some([]) -> No blobs are required
Some([a,b,c]) -> we know we need blobs [a,b,c]

Copy link
Member

@ethDreamer ethDreamer Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although perhaps you still want a way to say "we don't know, but we don't have blobs [a,c]" if we have blobs [b,d]
In that case, you might try returning something like this:

enum MissingBlobResponse {
    KnownMissing(Vec<BlobIdentifier>), // We know for certain these blobs are missing.
    PossibleMissing(Vec<BlobIdentifier>), // We think these blobs might be missing.
    BlobsNotRequired, // Blobs are not required.
}

impl Into<Vec<BlobIdentifier>> for MissingBlobResponse {
    fn into(self) -> Vec<BlobIdentifier> {
        match self {
            MissingBlobResponse::KnownMissing(v) => v,
            MissingBlobResponse::PossibleMissing(v) => v,
            MissingBlobResponse::BlobsNotRequired => vec![],
        }
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this enum works well, i ended up using it in some of the single blob lookup code too. right now, we have simple logic

  • if we're prior to deneb, blobs aren't required
  • if we see a block, we know everything
  • if we don't see a block, we know nothing

but we could update this to

  • if we see a blob at index 4, we know all blobs with index < 4 must exist and blobs with index > 4 may exist

this enum doesn't give us both of these pieces at once, but if we end up wanting that info i think this could be pretty easily extended for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we see a blob at index 4, we know all blobs with index < 4 must exist and blobs with index > 4 may exist

isn't this not necessarily true given that the proposer could give us a blob that's not in the block to fake us out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point so we'd have to make sure that info were updated whenever we do get the block. Might not be worth implementing, I'm not sure it gives us much benefit over just querying for everything

beacon_node/beacon_chain/src/data_availability_checker.rs Outdated Show resolved Hide resolved
@realbigsean realbigsean added the ready-for-review The code is ready for review label Sep 21, 2023
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid from my read-through. I wasn't familiar with a lot of this code before, especially the Deneb networking stuff, so I feel like I haven't given this an incredibly thorough examination. Still, the core idea seems solid, and I appreciated the docs and tests for the new components like AvailabilityView.

This also reinforces my desire to increase the testing of our Deneb code through simulation & network testing. I feel like these caches and channels could easily hide subtle race conditions. When we have some time, lets hit them with Hydra.

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking quite good. The abstractions are really neat! Thanks for the detailed description of the AvailabilityView and the rationale, really helped in understanding the code.

Haven't completely finished the review, but initial comments.

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this. LGTM.

@realbigsean realbigsean merged commit c7ddf1f into sigp:deneb-free-blobs Oct 3, 2023
24 of 28 checks passed
@realbigsean realbigsean deleted the move-deneb-lookup-delay-logic branch November 21, 2023 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deneb ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants