-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unhide shut-down-when-validator-slashed-enabled option #8011
unhide shut-down-when-validator-slashed-enabled option #8011
Conversation
@@ -154,13 +154,12 @@ public class ValidatorOptions { | |||
DEFAULT_VALIDATOR_IS_LOCAL_SLASHING_PROTECTION_SYNCHRONIZED_ENABLED; | |||
|
|||
@Option( | |||
names = {"--Xshut-down-when-validator-slashed-enabled"}, | |||
names = {"--shut-down-when-validator-slashed-enabled"}, | |||
paramLabel = "<BOOLEAN>", | |||
description = | |||
"If an owned validator key is detected as slashed, the node should terminate with exit code 2. In this case, the service should not be restarted.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may change to "If enabled and an owned validator key is detected as slashed, the node will terminate. In this case, the service should not be restarted." I think mentioning exit code 2 doesn't mean much to most of the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do you track when "the service should not be restarted" and when it's ok to restart. May be "In this case, the service should not be restarted, check documentation for more details."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the exit code mention has been added to the description after this previous comment.
I agree that it looks like an implementation detail and could be left in the doc and removed from the option description (the exit code is already mentioned and the detailed behavior explained in the doc).
What do you think @rolfyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the exit code mention from the description in the end. It looks like an implementation detail that could be left to the detailed documentation.
@zilm13 I didn't add check documentation for more details
since we could assume that this is a short description of the option and it goes without saying (we didn't do that for the doppelganger detection for example). I assume that anyone who's interested would check the option in the documentation first, since it's not hidden anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so i think this is an interesting one, sorry for taking so long.
IMO our responsibility is to shut the service down. We should be very clear about how we want this to work but this was what i was thinking....
My initial thoughts were:
- Startup like normal, no crazy checks or complications (this makes us compatible)
- If we detect one of our keys were slashed while we're in flight, we exit code 2. THE MAINTAINER would be responsible for ensuring their service does not restart after a code 2. This is something they should take ownership of and we can be clear about documenting that. - We might be able to document how this is done with systemctl potentially or link out to how it looks or something...
- After that, if you startup and theres a slashed key, it's not our problem, and if you get slashed again, then that's because the service wasn't configured IMO.
One option would be:
Given it's actually already written, I do wonder if we offer an option something like '--abort-startup-when-validator-slashed-enabled` or something (i hate that name, maybe theres a better one) then if someone really does want to never startup they'd have the option.... I'd be interested what people think about that small change to this...
so then if that new flag happened, you could do the current behaviour if thats what you want, then you'd not need to change service files (but changing the service file would be a LOT less noisy)
It'd be super important under this version to have your service file changed so that it doesn't just keep slashing you every startup.
Ultimately i don't feel super strongly about going this way, as long as we are only applying this 'abort on startup' in the case the flag is defined. I can see the argument that this way has a better UX, because if you have a slashed key you'll at least not be able to start performing duties until it's moved. On the other hand, if it's a slashed key in a folder of 20k, i'd want to know exactly the key name to be removing, because it's not like I can do a ls
to find it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, with the current implementation, there are two ways of detecting a slashed validator:
attester_slashing
orproposer_slashing
gossip messagesactive_slashed
orexited_slashed
validator status from the state (this happens when the slashing event gets included in a block. This is useful when we don't get the gossip message for some reason, probably network issues)
What happens currently is that usually we detect the slashing event quite rapidly from the gossip first and we shut down. Meanwhile, the slashing event gets included in a block and the validator status gets updated to active_slashed
in the state. Whenever we restart after the first shut down, we see that the validator status is active_slashed
and hence we shut down. This will be the case until we remove the slashed key.
With this in mind, the proposed --abort-startup-when-validator-slashed-enabled
doesn't really add anything since with the current --shut-down-when-validator-slashed-enabled
we won't startup because we check the validator statuses at each startup and when we will see the active_slashed
one, we will shut down.
In order to restart the node, the feature must be disabled or the slashed key removed. This could even be considered a good thing to prevent a risky restart.
IMO, if a user is using this feature, it means they are willing to do something when a validator gets slashed (be it removing the slashed key on disabling the feature).
It could make sense to add a complimentary whitelisting option as suggested by @tbenr, something like --ignore-slashed-keys=0x00000000000000,0x111111111111111
if for some reason the user would like to keep the feature enabled for the rest of the validator and the slashed ones running anyway.
Otherwise, I'll add this link to the systemd
restart option documentation to our doc. In our case, we'd like it ot be set as the following:
Restart=on-success
It will make it restart only when the process exit with a clean exit code
, defined as:
> exit code of 0;
> for types other than Type=oneshot, one of the signals SIGHUP, SIGINT, SIGTERM, or SIGPIPE;
> exit statuses and signals specified in SuccessExitStatus=.
TLDR
My opinion is: we either add --ignore-slashed-keys
or we do nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want an option for restart without removing the key from owned, it should be not just flag, it should be list of the public keys, it's the safest option.
Another issue, that Paul highlighted, if it's 5 keys of 20,000 for example, we need all this keys printed, but if they are slashed during one epoch not in one slot it could be an issue, though we are already doing our best printing not only one key (but we will stop on the first slot with slashed key detected).
Until we get some feedback about this feature from users we could pause trying to make it perfect and leave it as is or also add skip-list of slashed keys option. Both would be good to my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
4bd39e1
to
3a936b0
Compare
PR Description
Unhide the
shut-down-when-validator-slashed-enabled
option.The validator slashing detection feature has been tested on goerli. The slashed validators were detected successfully in few seconds and Teku shut down as expected in both scenarios:
A documentation PR is open and will be merged once this PR has been merged.
The link to the new documentation section in the changelog will be working once the new documentation is released (the current link is pointing to the latest version)
Fixed Issue(s)
fixes #7707
Documentation
doc-change-required
label to this PR if updates are required.Changelog