Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paul's Doppelganger Protection Updates #16

Merged
merged 53 commits into from
Jul 13, 2021

Conversation

realbigsean
Copy link
Owner

See: #15

@realbigsean realbigsean changed the title Paul sean dopple Paul's Doppelganger Protection Updates Jun 17, 2021
@paulhauner
Copy link

Sean, my understanding is that the ball is in your court to review this and merge it into your branch. Then, we can move forward from there and answer some outstanding questions from #15 over in sigp#2230.

I'm not trying to rush, just making sure that we're on the same page 🙂

@realbigsean
Copy link
Owner Author

Yep! On the same page 🙂

@realbigsean
Copy link
Owner Author

Just copying over your entire comment so we can discuss here:

Overview

Heyo 👋 Whilst reviewing sigp#2230 I wanted to make a few changes and I thought it would be easier for everyone to do that myself than to suggest in comments. However, once I started I got carried away and ended up with a rather large diff. That being said, I think the changes I've added here are a net benefit.

I've made this PR against your realbigsean/lighthouse fork. My intention is that you could review this PR (comment and changes) and then merge it into your PR. Then, we can finish off in sigp#2230 and merge that.

I've structured the comment here in two sections:

  • "Potential Issues...": these are things in your original PR which I think might need to change. I've detailed them so that you can correct me if I'm wrong or gain feedback if I'm right. You don't need to address them, I've already got to them in my changes 🙂
  • "Changes Introduced...": a summary of the changes I made in this PR. It should help with your review and to see if I've made mistakes.

Potential Issues in Existing Implementation

Doppelganger check happens at the start of the slot

In this update loop we're triggering doppelganger protection at the very start of the slot:

async move {
loop {
if let Some(duration) = doppelganger_service.slot_clock.duration_to_next_slot()
{
sleep(duration).await;
} else {
// Just sleep for one slot if we are unable to read the system clock, this gives
// us an opportunity for the clock to eventually come good.
sleep(doppelganger_service.slot_clock.slot_duration()).await;
continue;
}
if let Err(e) = doppelganger_service.detect_doppelgangers().await {
error!(log,"Error during doppelganger detection"; "error" => ?e);
}
}

Then, we run this code:

// If we are in the first slot of epoch N, consider checks in epoch N-1 completed.
if slot == epoch.start_slot(E::slots_per_epoch()) {
self.validator_store
.initialized_validators()
.write()
.complete_doppelganger_detection_in_epoch(epoch.saturating_sub(Epoch::new(1)));
}

So, this means that in the first instant of epoch N, we consider N - 1 to be safe. However, attestations from the last slot of epoch N - 1 are only first allowed to appear in blocks in epoch N. Since we do the doppelganger check so early in the first slot of epoch N we don't allow any time for a block to arrive with N - 1 attestations.

break instead of continue

Here there's a break which I think should be a continue:

The break would cancel all epochs instead of just skipping empty ones.

Enabling doppelganger for all validators if one BN call fails:

self.validator_store
.initialized_validators()
.write()
.update_all_initialization_epochs(current_epoch, genesis_epoch);

This seems safe, but perhaps overly safe. It might be the case that we've had a few validators running for a while and then added one more validator via the API. If that new validator triggers a failure from the BN then we'd enable doppleganger for all validators.

Completing detection even if call to BN fails

Regardless of if the liveness_response is Ok or Err we still consider the doppelganger check a success:

// If we are in the first slot of epoch N, consider checks in epoch N-1 completed.
if slot == epoch.start_slot(E::slots_per_epoch()) {
self.validator_store
.initialized_validators()
.write()
.complete_doppelganger_detection_in_epoch(epoch.saturating_sub(Epoch::new(1)));
}

I didn't follow this all the way through to ensure it's an issue, but it seems suspicious.

Changes Introduced in this PR

Firstly, I merged in unstable which has caused a lot of noise in this diff. You might want to merge it into your branch to quieten down the noise.

Changes include:

  • I've also added doppelganger checks to the signing functions as a second-line-of-defense against accidental signing.
  • I lifted doppelganger logic out of InitializedValidators and collected it in ValidatorStore. This meant I could remove current_epoch, genesis_epoch from all the initialization calls which reduced the scope of the change.
  • I modified the Doppelganger service to make it a bit easier to unit test. I also added a bunch of unit tests to it.
    • This resulted in a decent refactor. Sorry about that, I had a large break during the middle of this project and when I came back to it I needed to dig into it a bit more to ensure I didn't miss anything.
  • We now run the doppelganger check when we're 75% through the slot to give us a chance to see the blocks/attestations in that slot.
  • I changed it so that we only consider epoch N - 1 to be safe once we're in the last slot of epoch N. The reasoning here is that once we're 75% through the last slot of epoch N then we should have seen all blocks that could include attestations from epoch N - 1. This means we miss a few more epochs due to doppelganger, but it should be safer. Presently, if you start in epoch 1 then the first epoch you'll be able to participate in will be epoch 5. We could reduce DEFAULT_REMAINING_DETECTION_EPOCHS by 1 to make this epochs 1-4 if we wanted.

Additional Info

There's still a few open questions/issues remaining. Perhaps we just merge this PR and then copy them over to sigp#2230 and deal with them there.

  • The doppelganger tests you wrote seem to be failing. I had to deal with some merge conflicts, I might have broken something sorry.
  • I'm not sure we have tests that demonstrate the "everything went well" flow of starting post-genesis, being blocked by doppelganger, not seeing any doppels and then starting to sign. I think it would be cool to have tests like this, the idea of validators failing to start is the sort of thing that could keep us up at night.
  • I get the feeling we should disable this by default initially. This if for three reasons: (1) we've seen some pushback on [Merged by Bors] - Doppelganger detection sigp/lighthouse#2230 comments, (2) Infura isn't going to support this until we can get the liveness endpoint into Teku and (3) it might be nice to get a bit of real-world testing before we push it on everyone by default. I'm quite scared of this change cause it adds a point of failure in front of performing duties.
  • Should we reduce DEFAULT_REMAINING_DETECTION_EPOCHS to 1?

Copy link
Owner Author

@realbigsean realbigsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving comments here, but I plan to also make the updates (have started a couple). @paulhauner let me know if you disagree with anything.

Here are some responses to your initial set of comments as well:

Doppelganger check happens at the start of the slot

My thinking was that we should have already seen attestations on gossip at this point, but yea, we might as well also check for block inclusion.

break instead of continue

👍

Enabling doppelganger for all validators if one BN call fails:

Yes I do think this is overly safe and would rather try to keep all aspects of doppelganger detection on a per-validator basis. Would be an unfortunate scenario for some whales.

I lifted doppelganger logic out of InitializedValidators and collected it in ValidatorStore. This meant I could remove current_epoch, genesis_epoch from all the initialization calls which reduced the scope of the change.

Nice! I really didn't like passing those around but had some difficulty structuring things well

I modified the Doppelganger service to make it a bit easier to unit test. I also added a bunch of unit tests to it.

This is awesome, really struggled with testing easily

I changed it so that we only consider epoch N - 1 to be safe once we're in the last slot of epoch N

Yes I was making the assumption that attestations will be received in a timely manner, even though they could come up to 32 slots late. This is probably the way to go.

The doppelganger tests you wrote seem to be failing.

Updated this, I think it broke because the lighthouse exit code changed when we added the ShutdownReason in unstable.

I'm not sure we have tests that demonstrate the "everything went well" flow of starting post-genesis, being blocked by doppelganger, not seeing any doppels and then starting to sign.

Agree, since we need an API endpoint that includes doppelganger detection status for the sake of the UI, might add that to this PR and see if I can use it to accomplish this.

I get the feeling we should disable this by default initially.

I agree, I think regardless of how vocal we are about this one it'll take people by surprise

Should we reduce DEFAULT_REMAINING_DETECTION_EPOCHS to 1

Definitely down

S: FnMut(),
{
let request_epoch = request_slot.epoch(E::slots_per_epoch());
let previous_epoch = request_epoch.saturating_sub(1_u64);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just pass request_epoch into get_liveness and calculate previous_epoch within beacon_node_liveness

Copy link

@paulhauner paulhauner Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Having two, ordered Epoch params is risky.

Comment on lines 616 to 621
doppelganger_state.remaining_epochs =
doppelganger_state.remaining_epochs.saturating_sub(1);

// Since we just satisfied the `previous_epoch`, the next epoch to satisfy should be
// the one following that.
doppelganger_state.next_check_epoch = previous_epoch.saturating_add(1_u64);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this logic a method in DoppelgangerState


// Collect *all* pubkeys for resolving indices, even those undergoing doppelganger protection.
//
// Singe doppelganger protection queries rely on validator indices it is important to ensure we
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Singe doppelganger protection queries rely on validator indices it is important to ensure we
// Since doppelganger protection queries rely on validator indices it is important to ensure we

@@ -137,8 +139,10 @@ impl<T: SlotClock + 'static, E: EthSpec> DutiesService<T, E> {
pub fn block_proposers(&self, slot: Slot) -> HashSet<PublicKeyBytes> {
let epoch = slot.epoch(E::slots_per_epoch());

// this is the set of local pubkeys that are not currently in a doppelganger detection period
let signing_pubkeys = self.validator_store.signing_pubkeys_hashset();
// Only collect validators that are not presently disabled for doppelganger protection.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little confusing to figure out what "not disabled for doppelganger protection" means

Suggested change
// Only collect validators that are not presently disabled for doppelganger protection.
// Only collect validators that are considered safe in terms of doppelganger protection.

@@ -160,8 +164,10 @@ impl<T: SlotClock + 'static, E: EthSpec> DutiesService<T, E> {
pub fn attesters(&self, slot: Slot) -> Vec<DutyAndProof> {
let epoch = slot.epoch(E::slots_per_epoch());

// this is the set of local pubkeys that are not currently in a doppelganger detection period
let signing_pubkeys = self.validator_store.signing_pubkeys_hashset();
// Only collect validators that are not presently disabled for doppelganger protection.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

// A weird side-effect is that the BN will keep getting liveness queries that will be
// ignored by the VC. Since the VC *should* shutdown anyway, this seems fine.
if violators_exist {
doppelganger_state.remaining_epochs = u64::max_value();
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to my IDE u64::MAX should be used instead of u64::max_value()

// Abort the routine if inconsistency is detected.
.ok_or_else(|| format!("inconsistent states for validator pubkey {}", pubkey))?;

// If a single doppelganger is detected, enable doppelganger checks on all
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're able to do this on a validator by validator basis it'd be better

Copy link

@paulhauner paulhauner Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're going to shutdown the VC once we detect any doppelganger, aren't we effectively locked into the "all or none" behaviour?

The reason I'm jamming all of them here is so it's more firmly "all or none" not a "some until the VC finishes whatever it's doing".

To be clear, I'm adding this "jamming" logic in anticipation of things like sigp#2316 that might try to stop the VC from shutting down.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was related to my thinking below about maybe trying to recover if we can't shutdown.. but I am now thinking that is too risky. I am understanding the reasoning for this logic a bit more, hadn't considered the impact of sigp#2316

Comment on lines +599 to +602
// validators forever (technically only 2**64 epochs).
//
// This has the effect of stopping validator activity even if the validator client
// fails to shut down.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting this to u64::MAX could we just increment next_check_epoch and reset remaining_epochs to DEFAULT_REMAINING_EPOCHS? I can't imagine hitting this weird scenario where doppelganger fails, doesn't shutdown and the VC runs for 2**64 epochs... but it makes more sense to me that you'd just keep pushing back doppelganger detection if for some reason we detected a doppelganger but didn't shut down.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine hitting this weird scenario where doppelganger fails, doesn't shutdown and the VC runs for 2**64 epochs... but it makes more sense to me that you'd just keep pushing back doppelganger detection if for some reason we detected a doppelganger but didn't shut down.

By my calculations, 2**64 epochs is 224 trillion years so it's pretty safe. We can't really do any better than this either since next_check_epoch is ultimately a u64.

If we went with the push-back method, it seems to me like that has a possibility of recovering. E.g., someone turns off the remote doppelganger and then we'll start working again. Is that what you were suggesting?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we can't shut down maybe we should try to recover... but I guess it's still too much of a risk that the doppelganger just missed some attestations.

By my calculations, 2**64 epochs is 224 trillion years so it's pretty safe

About how long I plan to be staking for ^ 🤣

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we can't shut down maybe we should try to recover... but I guess it's still too much of a risk that the doppelganger just missed some attestations.

Yeah, I'm tempted to leave recovery alone. I'm not sure how to specify it in a way that is safe. I think it's nice to maintain the promise that "if a doppelganger is detected the VC will stop".

@paulhauner
Copy link

Thanks for the comments! I've added a 👍 to all the changes I agree with. I recall you saying you're happy to implement them, so go ahead!

There were a couple I didn't quite understand, so I left some comments there :)

.clone(),
);

validator_store.attach_doppelganger_service(service.clone())?;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found an issue where we are still proposing during doppelganger detection. Seems like it's because we are cloning the validator store before creating the doppelganger service, so the block service and attestation service never get a reference to the doppelganger service.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooft! Great find 😬

I'll knock up a patch for this. I think I'll switch the other services to Arc<ValidatorStore> and remove Clone for ValidatorStore. That type of pattern should help prevent this type of bug from reoccurring.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to make the patch too if you haven't started, just let me know 🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix over in a35bf31 looks good, thanks for that!

I'm wondering why the Option for DoppelgangerService was removed from ValidatorStore and replaced with an "enabled" bool? It seems like it would be safer to retain the Option since it forces us to check for "enabled" at the type-level. Perhaps I'm missing something 🙂

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the Option because adding an Arc around ValidatorStore makes it so we can't attach the DoppelgangerService to it after initialization without also adding a RwLock. So it was sort of an ergonomics/complexity reduction thing. Let me know if you want me to revert it and add the RwLock instead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at it and was able to add the Option without a RwLock here: paulhauner@1506166. I also removed Clone from DoppelgangerService to avoid a similar issue to the one I created with ValidatorStore.

I think it's worth having the Option since it effectively forces us to check "enabled".

Let me know what you think.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I do like that better!

@paulhauner
Copy link

Perhaps we should merge this into sigp#2230 and continue there? What are your thoughts?

@realbigsean
Copy link
Owner Author

Good with me!

@realbigsean realbigsean merged commit 97d6151 into doppleganger-detection Jul 13, 2021
@realbigsean realbigsean deleted the paul-sean-dopple branch November 21, 2023 16:15
realbigsean pushed a commit that referenced this pull request Apr 5, 2024
* Single-pass epoch processing (sigp#4483, sigp#4573)

Co-authored-by: Michael Sproul <[email protected]>

* Delete unused epoch processing code (sigp#5170)

* Delete unused epoch processing code

* Compare total deltas

* Remove unnecessary apply_pending

* cargo fmt

* Remove newline

* Use epoch cache in block packing (sigp#5223)

* Remove progressive balances mode (sigp#5224)

* inline inactivity_penalty_quotient_for_state

* drop previous_epoch_total_active_balance

* fc lint

* spec compliant process_sync_aggregate (#15)

* spec compliant process_sync_aggregate

* Update consensus/state_processing/src/per_block_processing/altair/sync_committee.rs

Co-authored-by: Michael Sproul <[email protected]>

---------

Co-authored-by: Michael Sproul <[email protected]>

* Delete the participation cache (#16)

* update help

* Fix op_pool tests

* Fix fork choice tests

* Merge remote-tracking branch 'sigp/unstable' into epoch-single-pass

* Simplify exit cache (sigp#5280)

* Fix clippy on exit cache

* Clean up single-pass a bit (sigp#5282)

* Address Mark's review of single-pass (sigp#5386)

* Merge remote-tracking branch 'origin/unstable' into epoch-single-pass

* Address Sean's review comments (sigp#5414)

* Address most of Sean's review comments

* Simplify total balance cache building

* Clean up unused junk

* Merge remote-tracking branch 'origin/unstable' into epoch-single-pass

* More self-review

* Merge remote-tracking branch 'origin/unstable' into epoch-single-pass

* Merge branch 'unstable' into epoch-single-pass

* Fix imports for beta compiler

* Fix tests, probably
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants