Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unhide shut-down-when-validator-slashed-enabled option #8011

Merged
merged 4 commits into from
Mar 12, 2024

Conversation

mehdi-aouadi
Copy link
Contributor

@mehdi-aouadi mehdi-aouadi commented Feb 24, 2024

PR Description

Unhide the shut-down-when-validator-slashed-enabled option.
The validator slashing detection feature has been tested on goerli. The slashed validators were detected successfully in few seconds and Teku shut down as expected in both scenarios:

  1. Single process VC/BN, the whole Teku node shut down
  2. Separate VC and BN, the VC running the slashed validator shut down as expected

A documentation PR is open and will be merged once this PR has been merged.
The link to the new documentation section in the changelog will be working once the new documentation is released (the current link is pointing to the latest version)

Fixed Issue(s)

fixes #7707

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if updates are required.

Changelog

  • I thought about adding a changelog entry, and added one if I deemed necessary.

@mehdi-aouadi mehdi-aouadi self-assigned this Feb 24, 2024
@@ -154,13 +154,12 @@ public class ValidatorOptions {
DEFAULT_VALIDATOR_IS_LOCAL_SLASHING_PROTECTION_SYNCHRONIZED_ENABLED;

@Option(
names = {"--Xshut-down-when-validator-slashed-enabled"},
names = {"--shut-down-when-validator-slashed-enabled"},
paramLabel = "<BOOLEAN>",
description =
"If an owned validator key is detected as slashed, the node should terminate with exit code 2. In this case, the service should not be restarted.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may change to "If enabled and an owned validator key is detected as slashed, the node will terminate. In this case, the service should not be restarted." I think mentioning exit code 2 doesn't mean much to most of the users.

Copy link
Contributor

@zilm13 zilm13 Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how do you track when "the service should not be restarted" and when it's ok to restart. May be "In this case, the service should not be restarted, check documentation for more details."?

Copy link
Contributor Author

@mehdi-aouadi mehdi-aouadi Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the exit code mention has been added to the description after this previous comment.
I agree that it looks like an implementation detail and could be left in the doc and removed from the option description (the exit code is already mentioned and the detailed behavior explained in the doc).
What do you think @rolfyone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the exit code mention from the description in the end. It looks like an implementation detail that could be left to the detailed documentation.
@zilm13 I didn't add check documentation for more details since we could assume that this is a short description of the option and it goes without saying (we didn't do that for the doppelganger detection for example). I assume that anyone who's interested would check the option in the documentation first, since it's not hidden anymore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so i think this is an interesting one, sorry for taking so long.
IMO our responsibility is to shut the service down. We should be very clear about how we want this to work but this was what i was thinking....

My initial thoughts were:

  • Startup like normal, no crazy checks or complications (this makes us compatible)
  • If we detect one of our keys were slashed while we're in flight, we exit code 2. THE MAINTAINER would be responsible for ensuring their service does not restart after a code 2. This is something they should take ownership of and we can be clear about documenting that. - We might be able to document how this is done with systemctl potentially or link out to how it looks or something...
  • After that, if you startup and theres a slashed key, it's not our problem, and if you get slashed again, then that's because the service wasn't configured IMO.

One option would be:
Given it's actually already written, I do wonder if we offer an option something like '--abort-startup-when-validator-slashed-enabled` or something (i hate that name, maybe theres a better one) then if someone really does want to never startup they'd have the option.... I'd be interested what people think about that small change to this...

so then if that new flag happened, you could do the current behaviour if thats what you want, then you'd not need to change service files (but changing the service file would be a LOT less noisy)
It'd be super important under this version to have your service file changed so that it doesn't just keep slashing you every startup.


Ultimately i don't feel super strongly about going this way, as long as we are only applying this 'abort on startup' in the case the flag is defined. I can see the argument that this way has a better UX, because if you have a slashed key you'll at least not be able to start performing duties until it's moved. On the other hand, if it's a slashed key in a folder of 20k, i'd want to know exactly the key name to be removing, because it's not like I can do a ls to find it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, with the current implementation, there are two ways of detecting a slashed validator:

  1. attester_slashing or proposer_slashing gossip messages
  2. active_slashed or exited_slashed validator status from the state (this happens when the slashing event gets included in a block. This is useful when we don't get the gossip message for some reason, probably network issues)

What happens currently is that usually we detect the slashing event quite rapidly from the gossip first and we shut down. Meanwhile, the slashing event gets included in a block and the validator status gets updated to active_slashed in the state. Whenever we restart after the first shut down, we see that the validator status is active_slashed and hence we shut down. This will be the case until we remove the slashed key.

With this in mind, the proposed --abort-startup-when-validator-slashed-enabled doesn't really add anything since with the current --shut-down-when-validator-slashed-enabled we won't startup because we check the validator statuses at each startup and when we will see the active_slashed one, we will shut down.
In order to restart the node, the feature must be disabled or the slashed key removed. This could even be considered a good thing to prevent a risky restart.

IMO, if a user is using this feature, it means they are willing to do something when a validator gets slashed (be it removing the slashed key on disabling the feature).

It could make sense to add a complimentary whitelisting option as suggested by @tbenr, something like --ignore-slashed-keys=0x00000000000000,0x111111111111111 if for some reason the user would like to keep the feature enabled for the rest of the validator and the slashed ones running anyway.

Otherwise, I'll add this link to the systemd restart option documentation to our doc. In our case, we'd like it ot be set as the following:
Restart=on-success
It will make it restart only when the process exit with a clean exit code, defined as:


> exit code of 0;
> for types other than Type=oneshot, one of the signals SIGHUP, SIGINT, SIGTERM, or SIGPIPE;
> exit statuses and signals specified in SuccessExitStatus=.

TLDR
My opinion is: we either add --ignore-slashed-keys or we do nothing

Copy link
Contributor

@zilm13 zilm13 Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want an option for restart without removing the key from owned, it should be not just flag, it should be list of the public keys, it's the safest option.

Another issue, that Paul highlighted, if it's 5 keys of 20,000 for example, we need all this keys printed, but if they are slashed during one epoch not in one slot it could be an issue, though we are already doing our best printing not only one key (but we will stop on the first slot with slashed key detected).

Until we get some feedback about this feature from users we could pause trying to make it perfect and leave it as is or also add skip-list of slashed keys option. Both would be good to my mind.

Copy link
Contributor

@zilm13 zilm13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mehdi-aouadi mehdi-aouadi enabled auto-merge (squash) March 12, 2024 10:14
@mehdi-aouadi mehdi-aouadi merged commit 87fa8ca into Consensys:master Mar 12, 2024
15 checks passed
@mehdi-aouadi mehdi-aouadi deleted the validator-slashing branch March 12, 2024 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator slashing protection
4 participants