Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bron-Kerbosch attestation aggregation #4507

Open
wants to merge 41 commits into
base: unstable
Choose a base branch
from

Conversation

GeemoCandama
Copy link
Contributor

Issue Addressed

This branch will address #3733

Proposed Changes

Incorporate the Bron-Kerbosch algorithm that Satalia wrote into the attestation aggregation flow. More info in #3733

Additional Info

@ankurdubey521 and I will be working on this.

@paulhauner paulhauner added the work-in-progress PR is a work-in-progress label Jul 26, 2023
@GeemoCandama GeemoCandama force-pushed the bron_kerbosch_attestation_aggregation branch from 5329233 to f39becd Compare July 28, 2023 19:22
@GeemoCandama
Copy link
Contributor Author

I think it should be ready for a first pass.

Overview:

  • Deleted the importing agg from naive_aggregation_pool since the unagg attestations should already be in the op_pool.
  • During unagg attestation processing add them to the op_pool
  • Made Compact Attestation Data Clone (curious if this is the best solution here"
  • Altered fields of AttestationDataMap to a HashMap for aggregated_attestation and a HashMap for unaggregated_attestations.
  • Altered methods on AttestationDataMap accordingly
  • Removed greedy aggregation on insertion to the op_pool
  • Port Bron-Kerbosch implementation from Satalia to Lighthouse
  • Add get_clique_aggregate_attestations_for_epoch -> Vec<(&CompactAttestationData, CompactIndexedAttestation<T>)> method (This is where most of the changes are. I'm curious if it would be better to return an Iterator)
  • Use AttMaxCover of the output of the above as input to max_cover

Also the changes made a couple of previously used functions unused. Should I go ahead and delete them?

@GeemoCandama
Copy link
Contributor Author

GeemoCandama commented Aug 16, 2023

I think the general ideas are there however there are a few tests that stall forever that I'm working on:

test store_tests::delete_blocks_and_states has been running for over 60 seconds
test store_tests::prune_single_block_long_skip has been running for over 60 seconds
test store_tests::shuffling_compatible_simple_fork has been running for over 60 seconds
test store_tests::block_production_different_shuffling_early has been running for over 60 seconds
test store_tests::block_production_different_shuffling_long has been running for over 60 seconds

I also tried what I thought would be an optimization of going through the cliques and filtering them if they are a subset of another clique before aggregate the attestations. That didn't seem to improve the situation, so I reverted that for now.

EDIT: This is no longer an issue

@GeemoCandama GeemoCandama force-pushed the bron_kerbosch_attestation_aggregation branch from b0da5a1 to af26f34 Compare September 28, 2023 15:31
@GeemoCandama
Copy link
Contributor Author

attestation_packing
This is from the block packing dashboard. I collect the parallel iterator because the validty_fn is not Sync. I think some of this could be worked out. I tried to return an iterator from the function but the validity_fn closure is captured and cant be returned.

@GeemoCandama GeemoCandama force-pushed the bron_kerbosch_attestation_aggregation branch from b9c8fae to d2bd6af Compare October 18, 2023 05:16
@GeemoCandama
Copy link
Contributor Author

Before:
Screenshot from 2023-10-19 01-38-38
After:
Screenshot from 2023-10-19 01-37-54
It's a slight improvement on average. But After has slightly higher highs.

@michaelsproul michaelsproul changed the title [WIP] Bron-Kerbosch attestation aggregation Bron-Kerbosch attestation aggregation Nov 3, 2023
@michaelsproul michaelsproul marked this pull request as ready for review November 3, 2023 02:19
@michaelsproul michaelsproul added ready-for-review The code is ready for review v4.6.0 ETA Q1 2024 under-review A reviewer has only partially completed a review. and removed work-in-progress PR is a work-in-progress ready-for-review The code is ready for review labels Nov 3, 2023
@michaelsproul
Copy link
Member

michaelsproul commented Nov 3, 2023

I was testing the performance with --subscribe-all-subnets --import-all-attestations and noticed that it starts to deteriorate after some time. I think it's because we're iterating and processing ~900k unaggregated attestations every slot once the pool fills up.

attestation_packing

This is even with max-aggregates-per-data set to 1!

I realised that the validity_filter is somewhat unnecessary to run for every unaggregated attestation, as the filter we actually use only depends on the checkpoint & the attestation data. I made a hacky patch for this here, which seems to have improved the performance. I'll clean that up tomorrow or early next week (or hand it over to you if you're keen). I think a better solution would be to make the validity filter take a pair of Epoch (or CheckpointKey) and &AttestationData.

I'll post an updated graph once it's had a chance to warm up for a bit. The time seems to be more like the 30ms we see for regular nodes 👌

@michaelsproul
Copy link
Member

Looks a lot better 🎉

attestation_packing_fixed

@michaelsproul
Copy link
Member

I think this is ready to go. I've been running it on one of our Holesky beacon nodes for the last few days without issue. If anything it's a little bit faster than unstable.

Here's the median attestation packing time for this PR (green) vs unstable (purple):

attestation_packing_time_median

Here's the 95th percentile:

attestation_packing_time_95pc

Here's total block production time (median over 12h):

total_time_median

95th percentile over 12h:

total_time_95pc

Investigating that spike with a shorter interval shows it's a spike to 2s+

total_time_1h_95pc

This doesn't really correspond to a spike in attestation packing time, so I suspect it's another issue (lord knows we have state issues while packing blocks). The version of unstable running on the other BNs is also running without the changes from #4794.

I'll let @paulhauner do a once-over of the Rayon stuff (because we've been bitten before), and then we can merge.

@michaelsproul
Copy link
Member

Logs from the spike also seem to show that it was actually only a spike to 1.2s, but even with histogram buckets I don't quite see how that makes any sense

Nov 08 06:08:01.222 DEBG Processed HTTP API request method: GET, path: /eth/v2/validator/blocks/293440, status: 200 OK, elapsed: 1.192266986s

@paulhauner paulhauner self-requested a review November 14, 2023 00:32
Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heyo, looking very impressive!

I had a look at just the Rayon code and I think there's a potential for a deadlock in there (see my comment).

.map(|(data, aggregates)| {
let aggregates: Vec<&CompactIndexedAttestation<T>> = aggregates
.iter()
.filter(|_| validity_filter(checkpoint_key, data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're at risk of the deadlock described in this HackMD (see also #1562).

The validity_filter function can try to access the fork choice read lock:

let fork_choice_lock = self.canonical_head.fork_choice_read_lock();

That means this function would behave the same as the rest_api::validator::publish_attestations function I describe in the HackMD.

In terms of a workaround, I'm not fully across this PR but I feel like the parallel functionality is most important for the second map fn (the one which calls bron_kerbosch). So perhaps the filter and first map could happen in a good ol' fashioned serial iter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I'm trying to understand how we can deadlock if we are only reading and never holding a write lock while spawning a Rayon task. I think we are safe because we only have reads (item 2 from the HackMD) and not writes (item 1).

Reading the Rayon docs a bit more, I think how the deadlock happens is this:

  1. Thread 1 obtains the write lock and then spawns a Rayon job on the Rayon thread pool and blocks waiting for it to complete.
  2. Concurrently, another thread spawns a bunch of jobs on the Rayon thread pool which try to grab the read lock.
  3. Rayon's pool "schedules" (="steals work such that") all of the reading jobs begin executing but not the job spawned by thread 1 which is required to release the lock. Unlike async tasks in Tokio-land, there is no implicit yielding from Rayon jobs once they start executing.
  4. Therefore we get a deadlock: all the threads in the pool are blocked waiting for a job that can only execute in the pool, which will never execute.

Without holding an exclusive lock/mutex while spawning I think we are safe. If all the jobs in the pool are just grabbing read locks and there is no other thread holding a write lock while it waits for a Rayon job to complete, then we are OK. Even if we have other (non-Rayon) threads grabbing the write lock, this is OK too, as they will be run concurrently by the OS, and the write lock will eventually be released, allowing the Rayon threads that are reading to make progress.

I went looking for places where we are already locking an RwLock/Mutex inside a Rayon job and actually the current version of the op pool uses rayon::join on two lazy iterators that evaluate the same validity_filter (and therefore obtain a read lock on fork choice from within the Rayon pool):

let (prev_cover, curr_cover) = rayon::join(
move || {
let _timer = metrics::start_timer(&metrics::ATTESTATION_PREV_EPOCH_PACKING_TIME);
// If we're in the genesis epoch, just use the current epoch attestations.
if prev_epoch_key == curr_epoch_key {
vec![]
} else {
maximum_cover(prev_epoch_att, prev_epoch_limit, "prev_epoch_attestations")
}
},
move || {
let _timer = metrics::start_timer(&metrics::ATTESTATION_CURR_EPOCH_PACKING_TIME);
maximum_cover(
curr_epoch_att,
T::MaxAttestations::to_usize(),
"curr_epoch_attestations",
)
},
);

i.e. we are already doing this and it is not deadlocking

Additionally in Milhouse, we are recursively obtaining a write lock inside Rayon jobs here:

https://github.com/sigp/milhouse/blob/6c82029bcbc656a3cd423d403d7551974818d45d/src/tree.rs#L430-L433

Mac and I played around with that code trying to use upgradable locks and quickly hit deadlocks, which I think implies the current pattern is safe (see our attempts in sigp/milhouse#25, sigp/milhouse#29).

TL;DR: I think that Rayon is only dangerous is you hold a lock while spawning jobs into the pool.

This was what I previously thought, but I'd never understood why this was the case and thought it had something to do with thread-local storage, context switches and mutexes. I find the explanation above quite satisfactory and hope you do too!

Also, if we were to remove the par_iter we need a collect, as @GeemoCandama showed in 376c408. If you agree with my analysis I reckon we revert that commit 🙏

Thanks for reviewing!

Copy link
Member

@paulhauner paulhauner Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without holding an exclusive lock/mutex while spawning I think we are safe.

I think we hold an exclusive lock whilst spawning here:

fork_choice
.on_block(current_slot, block, block_root, &state)
.map_err(|e| BlockError::BeaconChainError(e.into()))?;

We take a write-lock on fork choice, then we call on_block which might read a state from the DB which might involve rayon-parallelised tree hashing.

So the deadlock would be:

  1. We take the fork choice write lock in on_block but don't yet trigger tree hashing.
  2. We run the par_iter in this PR, which fills all the rayon threads with functions waiting on the fork choice read lock.
  3. on_block tries to start tree hashing but it can't start because all the rayon threads are busy.
  4. We are deadlocked because we need on_block to finish so we can drop the fork choice write-lock.

Copy link
Member

@michaelsproul michaelsproul Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh bugger, I think that could deadlock even with our current code then.

I think we need to choose between:

  1. Never hold a lock while spawning Rayon tasks, or
  2. Never obtain a lock in a Rayon task that could be held over a spawn

I have a feeling that (1) is better in some way, but need to sleep on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually does on_block load states? I can't see it at a quick glance. I also don't think loading a state does any tree hashing (although it will in tree-states land, so definitely something to keep an eye on).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, here:

let (_, state) = self
.store
.get_advanced_hot_state(
self.justified_checkpoint.root,
max_slot,
justified_block.state_root(),
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get some time today or tomorrow I'll try writing a test with hiatus that deadlocks on unstable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a lot of mucking around I got block production & tree hashing to deadlock on unstable, sort of. The branch is here: michaelsproul@5e253de

It was really hard to get all the bits of line up, so there are 7 separate hiatus breakpoints. I also had to cheat and just force tree hashing in import_block because I ran out of patience to get the tree hashing and state load in on_block to trigger. The main problem is that the attestations were triggering the justified checkpoint update before I could trigger it with on_block. I think using the mainnet spec would make this a bit easier as we wouldn't have to justify on the epoch boundary (as we do with minimal).

I think this deadlock (on unstable) hasn't been observed in practice for several reasons:

  1. It requires block production and block import to be running concurrently. This doesn't happen particularly often, but isn't that unlikely.
  2. It requires the balances cache in fork choice to miss. This almost never happens in recent versions of Lighthouse.
  3. It requires a machine with ~2 cores, because we only spawn two op pool tasks using rayon::join. If there are more than 2 Rayon threads in the pool then the tree hashing can make progress and drop the fork choice lock, which unblocks the op pool tasks.

In my example you can see the deadlock happen with:

RAYON_NUM_THREADS=2 FORK_NAME=capella cargo test --release --features fork_from_env -- deadlock_block_production --nocapture

If you bump the threadpool size to 3 then the deadlock doesn't happen (due to condition 3).

RAYON_NUM_THREADS=3 FORK_NAME=capella cargo test --release --features fork_from_env -- deadlock_block_production --nocapture

I haven't tried deadlocking Geemo's branch yet, but intuitively we make the deadlock more likely by spawning more threads for the attestation data (condition 3 no longer applies). Still, we are protected by the unlikeliness of condition 1 & condition 2 occurring simultaneously.

I need to think about it more.

@michaelsproul
Copy link
Member

Hey @GeemoCandama I think to play it safe lets remove all the Rayon calls and assess the performance impact. After going deep down the Rayon rabbit hole I don't think there's a completely safe way for us to use Rayon in the op pool at the moment (see #4952).

I think we should revisit parallelism in the op pool at a later date, maybe using a different dedicated thread pool, or Tokio. Hopefully #4925 buys us a little bit of wiggle room by reducing block production times overall.

@michaelsproul michaelsproul added v5.0.0 Q1 2024 and removed v4.6.0 ETA Q1 2024 labels Dec 15, 2023
@GeemoCandama
Copy link
Contributor Author

Screenshot from 2024-01-23 00-32-18

@michaelsproul
Copy link
Member

I did some benchmarking with --subscribe-all-subnets and attestation packing was taking 100-250ms regularly. I think we need the thread pool. I will add a non-Rayon thread pool when I have some time. I'll bump this from the v5.0.0 release, which needs to happen next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Something to make Lighthouse run more efficiently. under-review A reviewer has only partially completed a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants