Paul's Doppelganger Protection Updates #16

realbigsean · 2021-06-17T20:40:14Z

See: #15

paulhauner · 2021-06-22T00:24:49Z

Sean, my understanding is that the ball is in your court to review this and merge it into your branch. Then, we can move forward from there and answer some outstanding questions from #15 over in sigp#2230.

I'm not trying to rush, just making sure that we're on the same page 🙂

realbigsean · 2021-06-22T14:58:29Z

Yep! On the same page 🙂

realbigsean · 2021-06-22T15:05:55Z

Just copying over your entire comment so we can discuss here:

Overview

Heyo 👋 Whilst reviewing sigp#2230 I wanted to make a few changes and I thought it would be easier for everyone to do that myself than to suggest in comments. However, once I started I got carried away and ended up with a rather large diff. That being said, I think the changes I've added here are a net benefit.

I've made this PR against your realbigsean/lighthouse fork. My intention is that you could review this PR (comment and changes) and then merge it into your PR. Then, we can finish off in sigp#2230 and merge that.

I've structured the comment here in two sections:

"Potential Issues...": these are things in your original PR which I think might need to change. I've detailed them so that you can correct me if I'm wrong or gain feedback if I'm right. You don't need to address them, I've already got to them in my changes 🙂
"Changes Introduced...": a summary of the changes I made in this PR. It should help with your review and to see if I've made mistakes.

Potential Issues in Existing Implementation

Doppelganger check happens at the start of the slot

In this update loop we're triggering doppelganger protection at the very start of the slot:

lighthouse/validator_client/src/doppelganger_service.rs

Lines 47 to 62 in e39feee

    
           async move { 
        
               loop { 
        
                   if let Some(duration) = doppelganger_service.slot_clock.duration_to_next_slot() 
        
                   { 
        
                       sleep(duration).await; 
        
                   } else { 
        
                       // Just sleep for one slot if we are unable to read the system clock, this gives 
        
                       // us an opportunity for the clock to eventually come good. 
        
                       sleep(doppelganger_service.slot_clock.slot_duration()).await; 
        
                       continue; 
        
                   } 
        
                   if let Err(e) = doppelganger_service.detect_doppelgangers().await { 
        
                       error!(log,"Error during doppelganger detection"; "error" => ?e); 
        
                   } 
        
               }

Then, we run this code:

lighthouse/validator_client/src/doppelganger_service.rs

Lines 151 to 157 in e39feee

    
           // If we are in the first slot of epoch N, consider checks in epoch N-1 completed. 
        
           if slot == epoch.start_slot(E::slots_per_epoch()) { 
        
               self.validator_store 
        
                   .initialized_validators() 
        
                   .write() 
        
                   .complete_doppelganger_detection_in_epoch(epoch.saturating_sub(Epoch::new(1))); 
        
           }

So, this means that in the first instant of epoch N, we consider N - 1 to be safe. However, attestations from the last slot of epoch N - 1 are only first allowed to appear in blocks in epoch N. Since we do the doppelganger check so early in the first slot of epoch N we don't allow any time for a block to arrive with N - 1 attestations.

`break` instead of `continue`

Here there's a break which I think should be a continue:

lighthouse/validator_client/src/doppelganger_service.rs

Line 90 in e39feee

break;

The break would cancel all epochs instead of just skipping empty ones.

Enabling doppelganger for all validators if one BN call fails:

lighthouse/validator_client/src/doppelganger_service.rs

Lines 143 to 146 in e39feee

    
           self.validator_store 
        
               .initialized_validators() 
        
               .write() 
        
               .update_all_initialization_epochs(current_epoch, genesis_epoch);

This seems safe, but perhaps overly safe. It might be the case that we've had a few validators running for a while and then added one more validator via the API. If that new validator triggers a failure from the BN then we'd enable doppleganger for all validators.

Completing detection even if call to BN fails

Regardless of if the liveness_response is Ok or Err we still consider the doppelganger check a success:

lighthouse/validator_client/src/doppelganger_service.rs

Lines 151 to 157 in e39feee

    
           // If we are in the first slot of epoch N, consider checks in epoch N-1 completed. 
        
           if slot == epoch.start_slot(E::slots_per_epoch()) { 
        
               self.validator_store 
        
                   .initialized_validators() 
        
                   .write() 
        
                   .complete_doppelganger_detection_in_epoch(epoch.saturating_sub(Epoch::new(1))); 
        
           }

I didn't follow this all the way through to ensure it's an issue, but it seems suspicious.

Changes Introduced in this PR

Firstly, I merged in unstable which has caused a lot of noise in this diff. You might want to merge it into your branch to quieten down the noise.

Changes include:

I've also added doppelganger checks to the signing functions as a second-line-of-defense against accidental signing.
I lifted doppelganger logic out of InitializedValidators and collected it in ValidatorStore. This meant I could remove current_epoch, genesis_epoch from all the initialization calls which reduced the scope of the change.
I modified the Doppelganger service to make it a bit easier to unit test. I also added a bunch of unit tests to it.
- This resulted in a decent refactor. Sorry about that, I had a large break during the middle of this project and when I came back to it I needed to dig into it a bit more to ensure I didn't miss anything.
We now run the doppelganger check when we're 75% through the slot to give us a chance to see the blocks/attestations in that slot.
I changed it so that we only consider epoch N - 1 to be safe once we're in the last slot of epoch N. The reasoning here is that once we're 75% through the last slot of epoch N then we should have seen all blocks that could include attestations from epoch N - 1. This means we miss a few more epochs due to doppelganger, but it should be safer. Presently, if you start in epoch 1 then the first epoch you'll be able to participate in will be epoch 5. We could reduce DEFAULT_REMAINING_DETECTION_EPOCHS by 1 to make this epochs 1-4 if we wanted.

Additional Info

There's still a few open questions/issues remaining. Perhaps we just merge this PR and then copy them over to sigp#2230 and deal with them there.

The doppelganger tests you wrote seem to be failing. I had to deal with some merge conflicts, I might have broken something sorry.
I'm not sure we have tests that demonstrate the "everything went well" flow of starting post-genesis, being blocked by doppelganger, not seeing any doppels and then starting to sign. I think it would be cool to have tests like this, the idea of validators failing to start is the sort of thing that could keep us up at night.
I get the feeling we should disable this by default initially. This if for three reasons: (1) we've seen some pushback on [Merged by Bors] - Doppelganger detection sigp/lighthouse#2230 comments, (2) Infura isn't going to support this until we can get the liveness endpoint into Teku and (3) it might be nice to get a bit of real-world testing before we push it on everyone by default. I'm quite scared of this change cause it adds a point of failure in front of performing duties.
Should we reduce DEFAULT_REMAINING_DETECTION_EPOCHS to 1?

realbigsean

Leaving comments here, but I plan to also make the updates (have started a couple). @paulhauner let me know if you disagree with anything.

Here are some responses to your initial set of comments as well:

Doppelganger check happens at the start of the slot

My thinking was that we should have already seen attestations on gossip at this point, but yea, we might as well also check for block inclusion.

break instead of continue

👍

Enabling doppelganger for all validators if one BN call fails:

Yes I do think this is overly safe and would rather try to keep all aspects of doppelganger detection on a per-validator basis. Would be an unfortunate scenario for some whales.

I lifted doppelganger logic out of InitializedValidators and collected it in ValidatorStore. This meant I could remove current_epoch, genesis_epoch from all the initialization calls which reduced the scope of the change.

Nice! I really didn't like passing those around but had some difficulty structuring things well

I modified the Doppelganger service to make it a bit easier to unit test. I also added a bunch of unit tests to it.

This is awesome, really struggled with testing easily

I changed it so that we only consider epoch N - 1 to be safe once we're in the last slot of epoch N

Yes I was making the assumption that attestations will be received in a timely manner, even though they could come up to 32 slots late. This is probably the way to go.

The doppelganger tests you wrote seem to be failing.

Updated this, I think it broke because the lighthouse exit code changed when we added the ShutdownReason in unstable.

I'm not sure we have tests that demonstrate the "everything went well" flow of starting post-genesis, being blocked by doppelganger, not seeing any doppels and then starting to sign.

Agree, since we need an API endpoint that includes doppelganger detection status for the sake of the UI, might add that to this PR and see if I can use it to accomplish this.

I get the feeling we should disable this by default initially.

I agree, I think regardless of how vocal we are about this one it'll take people by surprise

Should we reduce DEFAULT_REMAINING_DETECTION_EPOCHS to 1

Definitely down

realbigsean · 2021-06-23T14:30:30Z

validator_client/src/doppelganger_service.rs

+        S: FnMut(),
+    {
+        let request_epoch = request_slot.epoch(E::slots_per_epoch());
+        let previous_epoch = request_epoch.saturating_sub(1_u64);


I think we should just pass request_epoch into get_liveness and calculate previous_epoch within beacon_node_liveness

Good call. Having two, ordered Epoch params is risky.

realbigsean · 2021-06-23T16:21:53Z

validator_client/src/doppelganger_service.rs

+                doppelganger_state.remaining_epochs =
+                    doppelganger_state.remaining_epochs.saturating_sub(1);
+
+                // Since we just satisfied the `previous_epoch`, the next epoch to satisfy should be
+                // the one following that.
+                doppelganger_state.next_check_epoch = previous_epoch.saturating_add(1_u64);


Could make this logic a method in DoppelgangerState

realbigsean · 2021-06-23T16:55:15Z

validator_client/src/duties_service.rs

+
+    // Collect *all* pubkeys for resolving indices, even those undergoing doppelganger protection.
+    //
+    // Singe doppelganger protection queries rely on validator indices it is important to ensure we


Suggested change

// Singe doppelganger protection queries rely on validator indices it is important to ensure we

// Since doppelganger protection queries rely on validator indices it is important to ensure we

realbigsean · 2021-06-23T16:56:58Z

validator_client/src/duties_service.rs

@@ -137,8 +139,10 @@ impl<T: SlotClock + 'static, E: EthSpec> DutiesService<T, E> {
    pub fn block_proposers(&self, slot: Slot) -> HashSet<PublicKeyBytes> {
        let epoch = slot.epoch(E::slots_per_epoch());

-        // this is the set of local pubkeys that are not currently in a doppelganger detection period
-        let signing_pubkeys = self.validator_store.signing_pubkeys_hashset();
+        // Only collect validators that are not presently disabled for doppelganger protection.


I think it's a little confusing to figure out what "not disabled for doppelganger protection" means

Suggested change

// Only collect validators that are not presently disabled for doppelganger protection.

// Only collect validators that are considered safe in terms of doppelganger protection.

realbigsean · 2021-06-23T16:57:25Z

validator_client/src/duties_service.rs

@@ -160,8 +164,10 @@ impl<T: SlotClock + 'static, E: EthSpec> DutiesService<T, E> {
    pub fn attesters(&self, slot: Slot) -> Vec<DutyAndProof> {
        let epoch = slot.epoch(E::slots_per_epoch());

-        // this is the set of local pubkeys that are not currently in a doppelganger detection period
-        let signing_pubkeys = self.validator_store.signing_pubkeys_hashset();
+        // Only collect validators that are not presently disabled for doppelganger protection.


realbigsean · 2021-06-23T20:51:46Z

validator_client/src/doppelganger_service.rs

+            // A weird side-effect is that the BN will keep getting liveness queries that will be
+            // ignored by the VC. Since the VC *should* shutdown anyway, this seems fine.
+            if violators_exist {
+                doppelganger_state.remaining_epochs = u64::max_value();


according to my IDE u64::MAX should be used instead of u64::max_value()

realbigsean · 2021-06-23T21:27:25Z

validator_client/src/doppelganger_service.rs

+                // Abort the routine if inconsistency is detected.
+                .ok_or_else(|| format!("inconsistent states for validator pubkey {}", pubkey))?;
+
+            // If a single doppelganger is detected, enable doppelganger checks on all


I think if we're able to do this on a validator by validator basis it'd be better

Since we're going to shutdown the VC once we detect any doppelganger, aren't we effectively locked into the "all or none" behaviour?

The reason I'm jamming all of them here is so it's more firmly "all or none" not a "some until the VC finishes whatever it's doing".

To be clear, I'm adding this "jamming" logic in anticipation of things like sigp#2316 that might try to stop the VC from shutting down.

My thinking here was related to my thinking below about maybe trying to recover if we can't shutdown.. but I am now thinking that is too risky. I am understanding the reasoning for this logic a bit more, hadn't considered the impact of sigp#2316

realbigsean · 2021-06-23T21:31:25Z

validator_client/src/doppelganger_service.rs

+            // validators forever (technically only 2**64 epochs).
+            //
+            // This has the effect of stopping validator activity even if the validator client
+            // fails to shut down.


Rather than setting this to u64::MAX could we just increment next_check_epoch and reset remaining_epochs to DEFAULT_REMAINING_EPOCHS? I can't imagine hitting this weird scenario where doppelganger fails, doesn't shutdown and the VC runs for 2**64 epochs... but it makes more sense to me that you'd just keep pushing back doppelganger detection if for some reason we detected a doppelganger but didn't shut down.

I can't imagine hitting this weird scenario where doppelganger fails, doesn't shutdown and the VC runs for 2**64 epochs... but it makes more sense to me that you'd just keep pushing back doppelganger detection if for some reason we detected a doppelganger but didn't shut down.

By my calculations, 2**64 epochs is 224 trillion years so it's pretty safe. We can't really do any better than this either since next_check_epoch is ultimately a u64.

If we went with the push-back method, it seems to me like that has a possibility of recovering. E.g., someone turns off the remote doppelganger and then we'll start working again. Is that what you were suggesting?

I was thinking if we can't shut down maybe we should try to recover... but I guess it's still too much of a risk that the doppelganger just missed some attestations.

By my calculations, 2**64 epochs is 224 trillion years so it's pretty safe

About how long I plan to be staking for ^ 🤣

I was thinking if we can't shut down maybe we should try to recover... but I guess it's still too much of a risk that the doppelganger just missed some attestations.

Yeah, I'm tempted to leave recovery alone. I'm not sure how to specify it in a way that is safe. I think it's nice to maintain the promise that "if a doppelganger is detected the VC will stop".

paulhauner · 2021-06-25T06:17:34Z

Thanks for the comments! I've added a 👍 to all the changes I agree with. I recall you saying you're happy to implement them, so go ahead!

There were a couple I didn't quite understand, so I left some comments there :)

- calculate next epoch in `beacon_node_liveness` method

…an/lighthouse into paul-sean-dopple

realbigsean · 2021-07-02T15:58:15Z

validator_client/src/lib.rs

+                    .clone(),
+            );
+
+            validator_store.attach_doppelganger_service(service.clone())?;


Found an issue where we are still proposing during doppelganger detection. Seems like it's because we are cloning the validator store before creating the doppelganger service, so the block service and attestation service never get a reference to the doppelganger service.

Ooft! Great find 😬

I'll knock up a patch for this. I think I'll switch the other services to Arc<ValidatorStore> and remove Clone for ValidatorStore. That type of pattern should help prevent this type of bug from reoccurring.

I'm happy to make the patch too if you haven't started, just let me know 🙂

The fix over in a35bf31 looks good, thanks for that!

I'm wondering why the Option for DoppelgangerService was removed from ValidatorStore and replaced with an "enabled" bool? It seems like it would be safer to retain the Option since it forces us to check for "enabled" at the type-level. Perhaps I'm missing something 🙂

I removed the Option because adding an Arc around ValidatorStore makes it so we can't attach the DoppelgangerService to it after initialization without also adding a RwLock. So it was sort of an ergonomics/complexity reduction thing. Let me know if you want me to revert it and add the RwLock instead

I had a go at it and was able to add the Option without a RwLock here: paulhauner@1506166. I also removed Clone from DoppelgangerService to avoid a similar issue to the one I created with ValidatorStore.

I think it's worth having the Option since it effectively forces us to check "enabled".

Let me know what you think.

Nice! I do like that better!

…ul-sean-dopple � Conflicts: � Cargo.lock

- Remove `Option` from `validator_store.doppelganger_detection` - rename doppelganger detection to doppelganger protection

- Update logging during doppelganger protection period

…an/lighthouse into paul-sean-dopple � Conflicts: � validator_client/src/http_api/mod.rs

paulhauner · 2021-07-13T01:14:34Z

Perhaps we should merge this into sigp#2230 and continue there? What are your thoughts?

realbigsean · 2021-07-13T15:19:07Z

Good with me!

* Single-pass epoch processing (sigp#4483, sigp#4573) Co-authored-by: Michael Sproul <[email protected]> * Delete unused epoch processing code (sigp#5170) * Delete unused epoch processing code * Compare total deltas * Remove unnecessary apply_pending * cargo fmt * Remove newline * Use epoch cache in block packing (sigp#5223) * Remove progressive balances mode (sigp#5224) * inline inactivity_penalty_quotient_for_state * drop previous_epoch_total_active_balance * fc lint * spec compliant process_sync_aggregate (#15) * spec compliant process_sync_aggregate * Update consensus/state_processing/src/per_block_processing/altair/sync_committee.rs Co-authored-by: Michael Sproul <[email protected]> --------- Co-authored-by: Michael Sproul <[email protected]> * Delete the participation cache (#16) * update help * Fix op_pool tests * Fix fork choice tests * Merge remote-tracking branch 'sigp/unstable' into epoch-single-pass * Simplify exit cache (sigp#5280) * Fix clippy on exit cache * Clean up single-pass a bit (sigp#5282) * Address Mark's review of single-pass (sigp#5386) * Merge remote-tracking branch 'origin/unstable' into epoch-single-pass * Address Sean's review comments (sigp#5414) * Address most of Sean's review comments * Simplify total balance cache building * Clean up unused junk * Merge remote-tracking branch 'origin/unstable' into epoch-single-pass * More self-review * Merge remote-tracking branch 'origin/unstable' into epoch-single-pass * Merge branch 'unstable' into epoch-single-pass * Fix imports for beta compiler * Fix tests, probably

paulhauner added 30 commits May 5, 2021 12:04

Change break -> continue

1668933

Change logs for global consistency

60a7376

Add incomplete progress

02e61c3

Restrict signing based on doppelganger

3bd276c

Add comments, fix violator detection

f9473d2

Merge branch 'unstable' into sean-dopple

604a551

Add comment

93e3b32

Further tidying

9b77a31

Introduce VotingPubkey enum

814e96b

Rename function

5471973

Fix test compile error

18e0284

Tidy

90317e9

Rename VotingPubkey -> DoppelgangerStatus

138ddba

Add doppelganger protection comment

6d156a9

Rename DoppelgangerStatus fns

a4216c3

Tidy comments

c0c859c

Skip validators with inconsistencies, tidy

568670d

Don't exit if BN returns nothing for liveness

b146c27

Appease clippy

91e182d

Refactor to smaller functions

589477e

Make detect_doppelgangers more generic

b5fe24e

Use generic shutdown signal

ff22c03

Add first tests

eb5e9c9

Add more testing

521eb70

More testing

a78cc52

Add success scenario test

d73b9dd

Add skip forward test

002c9e6

Add time skip backwards test

00813e1

Add staggered entry test

7c9ec9f

Remove slot clock from doppelganger

8304496

realbigsean changed the title ~~Paul sean dopple~~ Paul's Doppelganger Protection Updates Jun 17, 2021

realbigsean commented Jun 23, 2021

View reviewed changes

realbigsean added 7 commits June 29, 2021 13:13

- move DetectionState mutation to method.

d48f84e

- calculate next epoch in `beacon_node_liveness` method

fix comments, set DEFAULT_REMAINING_DETECTION_EPOCHS to 1

70a4e5a

Disable doppelganger detection by default

0d9c187

Add doppelganger status endpoint

7e7dcb6

Add doppelganger protection enabled method

59f272a

Add docs, fix tests and some comments

c3e8585

Merge branch 'doppleganger-detection' of https://github.com/realbigse…

88b7de7

…an/lighthouse into paul-sean-dopple

realbigsean commented Jul 2, 2021

View reviewed changes

realbigsean and others added 10 commits July 7, 2021 13:32

Merge branch 'unstable' of https://github.com/sigp/lighthouse into pa…

72b5f88

…ul-sean-dopple � Conflicts: � Cargo.lock

- And an Arc around the ValidatorStore

a35bf31

- Remove `Option` from `validator_store.doppelganger_detection` - rename doppelganger detection to doppelganger protection

- Add local testnet test for successful doppelganger protection period

ce55743

- Update logging during doppelganger protection period

remove unncessary mut

f4aee3c

update doppelganger test in ci

08e67f4

Merge branch 'doppleganger-detection' of https://github.com/realbigse…

a9283a3

…an/lighthouse into paul-sean-dopple � Conflicts: � validator_client/src/http_api/mod.rs

Add a timeout for the liveness request

56b921d

fix test compilation after merge

e786eb7

Use Option instead of enabled bool

b213f1b

fix typo and add fork slot to test .env files

2bf2860

fix tests

6e2484c

realbigsean merged commit 97d6151 into doppleganger-detection Jul 13, 2021

realbigsean deleted the paul-sean-dopple branch November 21, 2023 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paul's Doppelganger Protection Updates #16

Paul's Doppelganger Protection Updates #16

realbigsean commented Jun 17, 2021

paulhauner commented Jun 22, 2021

realbigsean commented Jun 22, 2021

realbigsean commented Jun 22, 2021

realbigsean left a comment

realbigsean Jun 23, 2021

paulhauner Jun 25, 2021 •

edited

Loading

realbigsean Jun 23, 2021

realbigsean Jun 23, 2021

realbigsean Jun 23, 2021

realbigsean Jun 23, 2021

realbigsean Jun 23, 2021

realbigsean Jun 23, 2021

paulhauner Jun 25, 2021 •

edited

Loading

realbigsean Jun 25, 2021

realbigsean Jun 23, 2021

paulhauner Jun 25, 2021

realbigsean Jun 25, 2021

paulhauner Jul 13, 2021

paulhauner commented Jun 25, 2021

realbigsean Jul 2, 2021

paulhauner Jul 6, 2021

realbigsean Jul 6, 2021

paulhauner Jul 12, 2021

realbigsean Jul 12, 2021

paulhauner Jul 13, 2021

realbigsean Jul 13, 2021

paulhauner commented Jul 13, 2021

realbigsean commented Jul 13, 2021

	// Singe doppelganger protection queries rely on validator indices it is important to ensure we
	// Since doppelganger protection queries rely on validator indices it is important to ensure we

	// Only collect validators that are not presently disabled for doppelganger protection.
	// Only collect validators that are considered safe in terms of doppelganger protection.

Paul's Doppelganger Protection Updates #16

Paul's Doppelganger Protection Updates #16

Conversation

realbigsean commented Jun 17, 2021

paulhauner commented Jun 22, 2021

realbigsean commented Jun 22, 2021

realbigsean commented Jun 22, 2021

Overview

Potential Issues in Existing Implementation

Doppelganger check happens at the start of the slot

break instead of continue

Enabling doppelganger for all validators if one BN call fails:

Completing detection even if call to BN fails

Changes Introduced in this PR

Additional Info

realbigsean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulhauner Jun 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulhauner Jun 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulhauner commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulhauner commented Jul 13, 2021

realbigsean commented Jul 13, 2021

`break` instead of `continue`

paulhauner Jun 25, 2021 •

edited

Loading

paulhauner Jun 25, 2021 •

edited

Loading