Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep PendingComponents in da_checker during import_block #5845

Merged
merged 5 commits into from
Jun 3, 2024

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented May 24, 2024

Issue Addressed

Extends

Sync lookups relay on having an accurate picture of block and blobs processing status to perform well (not re-download, not get stuck)

Sync lookup makes the following assumption:

  • if block in da_checker = block is known and valid
  • else if block in processing_cache = block is known but not valid (yet)

These assumptions are incorrect. During block import, a block is no longer in the da_checker, but it's still in the processing_cache. Because we remove the block from the da_checker when the da_checker imports the last component of the block.

The current sequence of events is

- process_block_with_early_caching
  - ..                                            <<< insert into reqresp_pre_import_cache
  - process_block
    - check_block_availability_and_import
      - put_pending_executed_block
        - ..                                      <<< insert into da_checker
        - (if available) 
          - ..                                    >>> remove from da_checker
      - process_availability
        - import_available_block
          - import_block
            - fork_choice.on_block
  - ..                                            >>> remove from reqresp_pre_import_cache

We should delay removal from the da_checker until after completing import_block

Proposed Changes

Delay removal from the da_checker until after completing import_block. New sequence:

- process_block_with_early_caching
  - ..                                            <<< insert into reqresp_pre_import_cache
  - process_block
    - check_block_availability_and_import
      - put_pending_executed_block
        - ..                                      <<< insert into da_checker
        - (if available) 
      - process_availability
        - import_available_block
          - import_block
            - fork_choice.on_block
          - ..                                    >>> remove from da_checker
  - ..                                            >>> remove from reqresp_pre_import_cache

Ideally we should remove from da_checker after removing from reqresp_pre_import_cache. However, import_block is called from many places (see call graph below), so it would make the logic more convoluted.

import_block
- import_available_block
  - process_block
    - process_chain_segment
    - process_block_with_early_caching
  - process_availability
    - check_block_availability_and_import
      - process_block
    - check_gossip_blob_availability_and_import
    - check_rpc_blob_availability_and_import

@dapplion dapplion changed the title Time in da checker Keep PendingComponents in da_checker during import_block May 24, 2024
@michaelsproul michaelsproul added the v5.2.0 Q2 2024 label May 27, 2024
@realbigsean realbigsean added the ready-for-review The code is ready for review label May 27, 2024
Comment on lines +501 to +502
/// Removes and returns the pending_components corresponding to
/// the `block_root` or `None` if it does not exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Removes and returns the pending_components corresponding to
/// the `block_root` or `None` if it does not exist
/// Removes the `pending_components` corresponding to the `block_root`.

@dapplion
Copy link
Collaborator Author

@ethDreamer plz need help with the tests, quite lost on why they are broken. overflow_cache_test_maintenance appears to have random behaviour, with the cache having different sizes on each run and breaking the assertion cache disk should have the rest

Comment on lines +625 to +629
write_lock.put_pending_components(
block_root,
pending_components.clone(),
&self.overflow_store,
)?;
Copy link
Member

@jimmygchen jimmygchen May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're cloning pending components for a second time here and I'm not sure if it's necessary.

If the pending component is newly created in this function, it won't immediately become available. So this means the pending components would always be in the overflow_store if it becomes available here right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh never mind, we do need this as we're mutating a clone!

Copy link
Member

@ethDreamer ethDreamer May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to modify pop_pending_components to get_pending_components. Instead you should keep it as is. Because the write lock is held the whole time, there is no difference between:

lock = get_write_lock()
item = cache.remove()
item.mutate()
cache.insert(item)
drop(lock)

and

lock = get_write_lock()
item = cache.get()
item.mutate()
cache.update(item)
drop(lock)

You can just re-insert the item even when the components are complete (just like you did above) and then remove them with the same pop_pending_components method.

&mut self,
block_root: Hash256,
store: &OverflowStore<T>,
) -> Result<Option<PendingComponents<T::EthSpec>>, AvailabilityCheckError> {
match self.in_memory.pop_entry(&block_root) {
Some((_, pending_components)) => Ok(Some(pending_components)),
match self.in_memory.get(&block_root) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be an rare edge case scenario here where:

  1. we load the PendingComponents from memory
  2. we update it and try to put it back, however in_memory is at capacity and it goes to the disk.
  3. next time we want to update it, it loads from memory which would give us the outdated component

In this case I think it would just delay availability as we would do another lookup right? I think we might be able to avoid this altogether if we check if the components is already in memory before we attempt to use the overflow store?

Another one that's probably less of an issue:

  1. We load the PendingComponents, not found in memory, so it returned from store
  2. We update it and put it back to memory, now we have two versions one in memory and one on disk
  3. Now if the in memory one somehow gets pruned first, then we end up loading an old component from disk.

I think this is unlikely under normal network condition as the LRU cache's default capacity is 1024.

@jimmygchen
Copy link
Member

overflow_cache_test_maintenance appears to have random behaviour, with the cache having different sizes on each run and breaking the assertion cache disk should have the rest

I think this may be due to the randomness in the mock execution layer:

// get random number between 0 and Max Blobs
let mut rng = self.rng.lock();
let num_blobs = rng.gen::<usize>() % (E::max_blobs_per_block() + 1);

Not sure if there's any reason to use random number of blobs here - perhaps we can make this deterministic while covering all blob counts per block?

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels May 29, 2024
michaelsproul added a commit that referenced this pull request May 30, 2024
Squashed commit of the following:

commit d75321b
Merge: 7c125b8 7f8b600
Author: realbigsean <[email protected]>
Date:   Mon May 27 19:52:41 2024 -0400

    Merge branch 'unstable' of https://github.com/sigp/lighthouse into time_in_da_checker

commit 7c125b8
Author: dapplion <[email protected]>
Date:   Sat May 25 01:32:02 2024 +0200

    Keep PendingComponents in da_checker during import_block

commit 7136412
Author: dapplion <[email protected]>
Date:   Sat May 25 01:01:23 2024 +0200

    Simplify BlockProcessStatus

commit dbcd7d1
Author: dapplion <[email protected]>
Date:   Fri May 24 20:00:05 2024 +0200

    Ensure lookup sync checks caches correctly
@michaelsproul michaelsproul mentioned this pull request May 30, 2024
@ethDreamer
Copy link
Member

Working on a PR to update the tests for the new logic. Will release shortly..

@ethDreamer
Copy link
Member

okay got the tests fixed:
dapplion#34

michaelsproul added a commit that referenced this pull request May 31, 2024
Squashed commit of the following:

commit d75321b
Merge: 7c125b8 7f8b600
Author: realbigsean <[email protected]>
Date:   Mon May 27 19:52:41 2024 -0400

    Merge branch 'unstable' of https://github.com/sigp/lighthouse into time_in_da_checker

commit 7c125b8
Author: dapplion <[email protected]>
Date:   Sat May 25 01:32:02 2024 +0200

    Keep PendingComponents in da_checker during import_block

commit 7136412
Author: dapplion <[email protected]>
Date:   Sat May 25 01:01:23 2024 +0200

    Simplify BlockProcessStatus

commit dbcd7d1
Author: dapplion <[email protected]>
Date:   Fri May 24 20:00:05 2024 +0200

    Ensure lookup sync checks caches correctly
michaelsproul added a commit that referenced this pull request Jun 3, 2024
Squashed commit of the following:

commit d75321b
Merge: 7c125b8 7f8b600
Author: realbigsean <[email protected]>
Date:   Mon May 27 19:52:41 2024 -0400

    Merge branch 'unstable' of https://github.com/sigp/lighthouse into time_in_da_checker

commit 7c125b8
Author: dapplion <[email protected]>
Date:   Sat May 25 01:32:02 2024 +0200

    Keep PendingComponents in da_checker during import_block

commit 7136412
Author: dapplion <[email protected]>
Date:   Sat May 25 01:01:23 2024 +0200

    Simplify BlockProcessStatus

commit dbcd7d1
Author: dapplion <[email protected]>
Date:   Fri May 24 20:00:05 2024 +0200

    Ensure lookup sync checks caches correctly
jimmygchen added a commit that referenced this pull request Jun 3, 2024
Squashed commit of the following:

commit c051218
Author: ethDreamer <[email protected]>
Date:   Fri May 31 13:38:19 2024 +0200

    Fix tests with DA checker new eviction policy (#34)

commit d75321b
Merge: 7c125b8 7f8b600
Author: realbigsean <[email protected]>
Date:   Mon May 27 19:52:41 2024 -0400

    Merge branch 'unstable' of https://github.com/sigp/lighthouse into time_in_da_checker

commit 7c125b8
Author: dapplion <[email protected]>
Date:   Sat May 25 01:32:02 2024 +0200

    Keep PendingComponents in da_checker during import_block

commit 7136412
Author: dapplion <[email protected]>
Date:   Sat May 25 01:01:23 2024 +0200

    Simplify BlockProcessStatus

commit dbcd7d1
Author: dapplion <[email protected]>
Date:   Fri May 24 20:00:05 2024 +0200

    Ensure lookup sync checks caches correctly
@realbigsean
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jun 3, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at cb32807

@realbigsean realbigsean added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jun 3, 2024
mergify bot added a commit that referenced this pull request Jun 3, 2024
@mergify mergify bot merged commit cb32807 into sigp:unstable Jun 3, 2024
28 checks passed
@dapplion dapplion deleted the time_in_da_checker branch June 5, 2024 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. v5.2.0 Q2 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants