-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paul's Doppelganger Protection Updates #16
Conversation
Yep! On the same page 🙂 |
Just copying over your entire comment so we can discuss here: OverviewHeyo 👋 Whilst reviewing sigp#2230 I wanted to make a few changes and I thought it would be easier for everyone to do that myself than to suggest in comments. However, once I started I got carried away and ended up with a rather large diff. That being said, I think the changes I've added here are a net benefit. I've made this PR against your I've structured the comment here in two sections:
Potential Issues in Existing ImplementationDoppelganger check happens at the start of the slotIn this update loop we're triggering doppelganger protection at the very start of the slot: lighthouse/validator_client/src/doppelganger_service.rs Lines 47 to 62 in e39feee
Then, we run this code: lighthouse/validator_client/src/doppelganger_service.rs Lines 151 to 157 in e39feee
So, this means that in the first instant of epoch
|
break; |
The break
would cancel all epochs instead of just skipping empty ones.
Enabling doppelganger for all validators if one BN call fails:
lighthouse/validator_client/src/doppelganger_service.rs
Lines 143 to 146 in e39feee
self.validator_store | |
.initialized_validators() | |
.write() | |
.update_all_initialization_epochs(current_epoch, genesis_epoch); |
This seems safe, but perhaps overly safe. It might be the case that we've had a few validators running for a while and then added one more validator via the API. If that new validator triggers a failure from the BN then we'd enable doppleganger for all validators.
Completing detection even if call to BN fails
Regardless of if the liveness_response
is Ok
or Err
we still consider the doppelganger check a success:
lighthouse/validator_client/src/doppelganger_service.rs
Lines 151 to 157 in e39feee
// If we are in the first slot of epoch N, consider checks in epoch N-1 completed. | |
if slot == epoch.start_slot(E::slots_per_epoch()) { | |
self.validator_store | |
.initialized_validators() | |
.write() | |
.complete_doppelganger_detection_in_epoch(epoch.saturating_sub(Epoch::new(1))); | |
} |
I didn't follow this all the way through to ensure it's an issue, but it seems suspicious.
Changes Introduced in this PR
Firstly, I merged in unstable
which has caused a lot of noise in this diff. You might want to merge it into your branch to quieten down the noise.
Changes include:
- I've also added doppelganger checks to the signing functions as a second-line-of-defense against accidental signing.
- I lifted doppelganger logic out of
InitializedValidators
and collected it inValidatorStore
. This meant I could removecurrent_epoch, genesis_epoch
from all the initialization calls which reduced the scope of the change. - I modified the
Doppelganger
service to make it a bit easier to unit test. I also added a bunch of unit tests to it.- This resulted in a decent refactor. Sorry about that, I had a large break during the middle of this project and when I came back to it I needed to dig into it a bit more to ensure I didn't miss anything.
- We now run the doppelganger check when we're 75% through the slot to give us a chance to see the blocks/attestations in that slot.
- I changed it so that we only consider epoch
N - 1
to be safe once we're in the last slot of epochN
. The reasoning here is that once we're 75% through the last slot of epochN
then we should have seen all blocks that could include attestations from epochN - 1
. This means we miss a few more epochs due to doppelganger, but it should be safer. Presently, if you start in epoch 1 then the first epoch you'll be able to participate in will be epoch 5. We could reduceDEFAULT_REMAINING_DETECTION_EPOCHS
by 1 to make this epochs 1-4 if we wanted.
Additional Info
There's still a few open questions/issues remaining. Perhaps we just merge this PR and then copy them over to sigp#2230 and deal with them there.
- The doppelganger tests you wrote seem to be failing. I had to deal with some merge conflicts, I might have broken something sorry.
- I'm not sure we have tests that demonstrate the "everything went well" flow of starting post-genesis, being blocked by doppelganger, not seeing any doppels and then starting to sign. I think it would be cool to have tests like this, the idea of validators failing to start is the sort of thing that could keep us up at night.
- I get the feeling we should disable this by default initially. This if for three reasons: (1) we've seen some pushback on [Merged by Bors] - Doppelganger detection sigp/lighthouse#2230 comments, (2) Infura isn't going to support this until we can get the liveness endpoint into Teku and (3) it might be nice to get a bit of real-world testing before we push it on everyone by default. I'm quite scared of this change cause it adds a point of failure in front of performing duties.
- Should we reduce
DEFAULT_REMAINING_DETECTION_EPOCHS
to1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving comments here, but I plan to also make the updates (have started a couple). @paulhauner let me know if you disagree with anything.
Here are some responses to your initial set of comments as well:
Doppelganger check happens at the start of the slot
My thinking was that we should have already seen attestations on gossip at this point, but yea, we might as well also check for block inclusion.
break
instead ofcontinue
👍
Enabling doppelganger for all validators if one BN call fails:
Yes I do think this is overly safe and would rather try to keep all aspects of doppelganger detection on a per-validator basis. Would be an unfortunate scenario for some whales.
I lifted doppelganger logic out of
InitializedValidators
and collected it inValidatorStore
. This meant I could removecurrent_epoch
,genesis_epoch
from all the initialization calls which reduced the scope of the change.
Nice! I really didn't like passing those around but had some difficulty structuring things well
I modified the Doppelganger service to make it a bit easier to unit test. I also added a bunch of unit tests to it.
This is awesome, really struggled with testing easily
I changed it so that we only consider epoch N - 1 to be safe once we're in the last slot of epoch N
Yes I was making the assumption that attestations will be received in a timely manner, even though they could come up to 32 slots late. This is probably the way to go.
The doppelganger tests you wrote seem to be failing.
Updated this, I think it broke because the lighthouse exit code changed when we added the ShutdownReason
in unstable
.
I'm not sure we have tests that demonstrate the "everything went well" flow of starting post-genesis, being blocked by doppelganger, not seeing any doppels and then starting to sign.
Agree, since we need an API endpoint that includes doppelganger detection status for the sake of the UI, might add that to this PR and see if I can use it to accomplish this.
I get the feeling we should disable this by default initially.
I agree, I think regardless of how vocal we are about this one it'll take people by surprise
Should we reduce DEFAULT_REMAINING_DETECTION_EPOCHS to 1
Definitely down
S: FnMut(), | ||
{ | ||
let request_epoch = request_slot.epoch(E::slots_per_epoch()); | ||
let previous_epoch = request_epoch.saturating_sub(1_u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just pass request_epoch
into get_liveness
and calculate previous_epoch
within beacon_node_liveness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Having two, ordered Epoch
params is risky.
doppelganger_state.remaining_epochs = | ||
doppelganger_state.remaining_epochs.saturating_sub(1); | ||
|
||
// Since we just satisfied the `previous_epoch`, the next epoch to satisfy should be | ||
// the one following that. | ||
doppelganger_state.next_check_epoch = previous_epoch.saturating_add(1_u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make this logic a method in DoppelgangerState
|
||
// Collect *all* pubkeys for resolving indices, even those undergoing doppelganger protection. | ||
// | ||
// Singe doppelganger protection queries rely on validator indices it is important to ensure we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Singe doppelganger protection queries rely on validator indices it is important to ensure we | |
// Since doppelganger protection queries rely on validator indices it is important to ensure we |
@@ -137,8 +139,10 @@ impl<T: SlotClock + 'static, E: EthSpec> DutiesService<T, E> { | |||
pub fn block_proposers(&self, slot: Slot) -> HashSet<PublicKeyBytes> { | |||
let epoch = slot.epoch(E::slots_per_epoch()); | |||
|
|||
// this is the set of local pubkeys that are not currently in a doppelganger detection period | |||
let signing_pubkeys = self.validator_store.signing_pubkeys_hashset(); | |||
// Only collect validators that are not presently disabled for doppelganger protection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little confusing to figure out what "not disabled for doppelganger protection" means
// Only collect validators that are not presently disabled for doppelganger protection. | |
// Only collect validators that are considered safe in terms of doppelganger protection. |
@@ -160,8 +164,10 @@ impl<T: SlotClock + 'static, E: EthSpec> DutiesService<T, E> { | |||
pub fn attesters(&self, slot: Slot) -> Vec<DutyAndProof> { | |||
let epoch = slot.epoch(E::slots_per_epoch()); | |||
|
|||
// this is the set of local pubkeys that are not currently in a doppelganger detection period | |||
let signing_pubkeys = self.validator_store.signing_pubkeys_hashset(); | |||
// Only collect validators that are not presently disabled for doppelganger protection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
// A weird side-effect is that the BN will keep getting liveness queries that will be | ||
// ignored by the VC. Since the VC *should* shutdown anyway, this seems fine. | ||
if violators_exist { | ||
doppelganger_state.remaining_epochs = u64::max_value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to my IDE u64::MAX
should be used instead of u64::max_value()
// Abort the routine if inconsistency is detected. | ||
.ok_or_else(|| format!("inconsistent states for validator pubkey {}", pubkey))?; | ||
|
||
// If a single doppelganger is detected, enable doppelganger checks on all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we're able to do this on a validator by validator basis it'd be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're going to shutdown the VC once we detect any doppelganger, aren't we effectively locked into the "all or none" behaviour?
The reason I'm jamming all of them here is so it's more firmly "all or none" not a "some until the VC finishes whatever it's doing".
To be clear, I'm adding this "jamming" logic in anticipation of things like sigp#2316 that might try to stop the VC from shutting down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking here was related to my thinking below about maybe trying to recover if we can't shutdown.. but I am now thinking that is too risky. I am understanding the reasoning for this logic a bit more, hadn't considered the impact of sigp#2316
// validators forever (technically only 2**64 epochs). | ||
// | ||
// This has the effect of stopping validator activity even if the validator client | ||
// fails to shut down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than setting this to u64::MAX
could we just increment next_check_epoch
and reset remaining_epochs
to DEFAULT_REMAINING_EPOCHS
? I can't imagine hitting this weird scenario where doppelganger fails, doesn't shutdown and the VC runs for 2**64 epochs... but it makes more sense to me that you'd just keep pushing back doppelganger detection if for some reason we detected a doppelganger but didn't shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't imagine hitting this weird scenario where doppelganger fails, doesn't shutdown and the VC runs for 2**64 epochs... but it makes more sense to me that you'd just keep pushing back doppelganger detection if for some reason we detected a doppelganger but didn't shut down.
By my calculations, 2**64 epochs is 224 trillion years so it's pretty safe. We can't really do any better than this either since next_check_epoch
is ultimately a u64
.
If we went with the push-back method, it seems to me like that has a possibility of recovering. E.g., someone turns off the remote doppelganger and then we'll start working again. Is that what you were suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if we can't shut down maybe we should try to recover... but I guess it's still too much of a risk that the doppelganger just missed some attestations.
By my calculations, 2**64 epochs is 224 trillion years so it's pretty safe
About how long I plan to be staking for ^ 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if we can't shut down maybe we should try to recover... but I guess it's still too much of a risk that the doppelganger just missed some attestations.
Yeah, I'm tempted to leave recovery alone. I'm not sure how to specify it in a way that is safe. I think it's nice to maintain the promise that "if a doppelganger is detected the VC will stop".
Thanks for the comments! I've added a 👍 to all the changes I agree with. I recall you saying you're happy to implement them, so go ahead! There were a couple I didn't quite understand, so I left some comments there :) |
- calculate next epoch in `beacon_node_liveness` method
…an/lighthouse into paul-sean-dopple
validator_client/src/lib.rs
Outdated
.clone(), | ||
); | ||
|
||
validator_store.attach_doppelganger_service(service.clone())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found an issue where we are still proposing during doppelganger detection. Seems like it's because we are cloning the validator store before creating the doppelganger service, so the block service and attestation service never get a reference to the doppelganger service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooft! Great find 😬
I'll knock up a patch for this. I think I'll switch the other services to Arc<ValidatorStore>
and remove Clone
for ValidatorStore
. That type of pattern should help prevent this type of bug from reoccurring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to make the patch too if you haven't started, just let me know 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix over in a35bf31 looks good, thanks for that!
I'm wondering why the Option
for DoppelgangerService
was removed from ValidatorStore
and replaced with an "enabled" bool
? It seems like it would be safer to retain the Option
since it forces us to check for "enabled" at the type-level. Perhaps I'm missing something 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the Option
because adding an Arc
around ValidatorStore
makes it so we can't attach the DoppelgangerService
to it after initialization without also adding a RwLock
. So it was sort of an ergonomics/complexity reduction thing. Let me know if you want me to revert it and add the RwLock
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a go at it and was able to add the Option
without a RwLock
here: paulhauner@1506166. I also removed Clone
from DoppelgangerService
to avoid a similar issue to the one I created with ValidatorStore
.
I think it's worth having the Option
since it effectively forces us to check "enabled".
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I do like that better!
…ul-sean-dopple � Conflicts: � Cargo.lock
- Remove `Option` from `validator_store.doppelganger_detection` - rename doppelganger detection to doppelganger protection
- Update logging during doppelganger protection period
…an/lighthouse into paul-sean-dopple � Conflicts: � validator_client/src/http_api/mod.rs
Perhaps we should merge this into sigp#2230 and continue there? What are your thoughts? |
Good with me! |
* Single-pass epoch processing (sigp#4483, sigp#4573) Co-authored-by: Michael Sproul <[email protected]> * Delete unused epoch processing code (sigp#5170) * Delete unused epoch processing code * Compare total deltas * Remove unnecessary apply_pending * cargo fmt * Remove newline * Use epoch cache in block packing (sigp#5223) * Remove progressive balances mode (sigp#5224) * inline inactivity_penalty_quotient_for_state * drop previous_epoch_total_active_balance * fc lint * spec compliant process_sync_aggregate (#15) * spec compliant process_sync_aggregate * Update consensus/state_processing/src/per_block_processing/altair/sync_committee.rs Co-authored-by: Michael Sproul <[email protected]> --------- Co-authored-by: Michael Sproul <[email protected]> * Delete the participation cache (#16) * update help * Fix op_pool tests * Fix fork choice tests * Merge remote-tracking branch 'sigp/unstable' into epoch-single-pass * Simplify exit cache (sigp#5280) * Fix clippy on exit cache * Clean up single-pass a bit (sigp#5282) * Address Mark's review of single-pass (sigp#5386) * Merge remote-tracking branch 'origin/unstable' into epoch-single-pass * Address Sean's review comments (sigp#5414) * Address most of Sean's review comments * Simplify total balance cache building * Clean up unused junk * Merge remote-tracking branch 'origin/unstable' into epoch-single-pass * More self-review * Merge remote-tracking branch 'origin/unstable' into epoch-single-pass * Merge branch 'unstable' into epoch-single-pass * Fix imports for beta compiler * Fix tests, probably
See: #15