-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine API: add validAncestorHash to engine_executePayload #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Made a couple of suggestions
cc @karalabe PTAL? |
2524d97
to
542948b
Compare
A couple of cases,
But it may be difficult to pinpoint the valid ancestor, as opposed to a 'current valid known block/height' (which may or may not be an ancestor). |
Due to the requirement that CL optimistically process beacon blocks when EL returns The impetus could then be on either CL or EL to figure things out. The proposed solution has EL try to figure it out by essentially saying the depth of the branch in question is invalid. The primary problem I now see is if CL is asking about EL blocks that are entirely unavailable (e.g. due to it being an invalid chain of >30k+ depth that has been pruned). This would lead the EL not being able to respond with anything other than Your cases:
iiuc, the only time the latest valid ancestor would be inaccessible is if this "bad" chain was pruned and EL could no longer find it when CL tries to insert a payload. So the questions are:
Or do we make an assumption that CL will in almost all realistic cases not be led down a simultaneously invalid and unavailable EL chain when syncing because this would require a 1/3 slashable actor. And if it did happen, it would only result in a small set nodes being tricked and put into a bad sync state. |
Okay, based on our convo today, we are going to keep the functionality as specified here -- return the most recent There are still some cases where EL might never be able to resolve How to surface such a sync failure is not in the scope of this PR and generally not in the scope of the core of the engine API |
src/engine/specification.md
Outdated
4. In the case when the parent block is unknown, client software **MUST** pull the block from the network and take one of the following actions depending on the parent block properties: | ||
- If the parent block is a PoW block as per [EIP-3675](https://eips.ethereum.org/EIPS/eip-3675#specification) definition, then all missing dependencies of the payload **MUST** be pulled from the network and validated accordingly. The call **MUST** be responded according to the validity of the payload and the chain of its ancestors. | ||
- If the parent block is a PoS block as per [EIP-3675](https://eips.ethereum.org/EIPS/eip-3675#specification) definition, then the call **MAY** be responded with `SYNCING` status and sync process **SHOULD** be initiated accordingly. | ||
3. Client software **MUST** return `{status: SYNCING, lastestValidHash: None}` if the client software does not have the requisite data available locally to validate the payload and cannot retrieve this required data in less than `SLOTS_PER_SECOND / 30` (0.4s in the Mainnet configuration) or if the sync process is already in progress. In the event that requisite data to validate the payload is missing (e.g. does not have payload identified by `parentHash`), the client software **SHOULD** initiate the sync process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale of the timeout? Is the timer starts ticking after receiving the call and including the corresponding disk/cache access that tries to pull out the parent's state? If we want to give EL some time to go to the wire and pull e.g. a parent block then would this timeout cover the execution of a parent block or just pulling it from the wire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the timeout only includes non-local retrieval. If you have the info locally (in a disk or memory cache) you should not be responding with SYNCING
because you are in fact synced
The idea is that we want to give some guidance on when EL should decide, "I don't have the info locally required and don't have enough time to quickly sync to respond to the call"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the timer starts ticking when EL client initiates the sync process and ends when the missing block is received and executed, i.e. the timeout is applied to the sync process. I think the timeout mechanism should be specified with more details to avoid bugs due to misreading the spec.
I am thinking about cases when this timeout is beneficial for the node. Here is what comes on mind:
- Temporal outage of EL software leading to EL's head behind the CL's one. If the diff is one EL block only then validator will likely not miss its attestation opportunity as the dependency will be resolved without EL turning into
SYNCING
; it it's more than a block then it's likely gonna turn intoSYNCING
IMO, this mechanism could be even more beneficial if it targeted logical beacon chain clock instead of just absolute time. I.e. "if initiated sync process isn't finished until TIMESTAMP + (SLOTS_PER_SECOND / 3 - 1)
then EL must respond with SYNCING
". SLOTS_PER_SECOND / 3 - 1
-- here 1
is a placeholder and should be the time normally needed to execute the payload + respond back to CL + CL to update its fork choice state. The target here is to have the beacon block accepted before the attestation boundary.
Though, any of this approaches look complicated and bug prone. Taking into account the infrequency of such cases to appear do we really want this mechanism to be implemented? And if we do, the next question is the spec a good place for it or it could be a part of implementation? My current impression is that this is premature optimisation and when the need of this optimisation will be seen it could be done by implementers without requisition for the change to the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with pulling QoS out of this PR. I'll pull it out into a separate issue for discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed here b6ded72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and opened relevant issue here -- #91
5e5495f
to
fa01a68
Compare
b6ded72
to
741d0a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
NOTE: Currently based on #83 and pointed to that PR as the base. Trying to keep the diff legible while continuing to be able to work. Will rebase and point this PR to
main
once #83 is mergedINVALID
is insufficient to parse the validity of ancestor blocks that previously returnedSYNCING
. As per discussions at interop, addvalidAncestorHash
in the return value to point to the latest (by block number) valid ancestor to the payload that is being processed. In the event ofINVALID
, this can thus invalidate an arbitrary number of blocks in the chain starting with the payload.Did discuss this with @holiman to ensure that this is feasible in geth.
Additionally:
SYNCING
--SLOTS_PER_SECOND / 30
(0.4s)SYNCING
return value. Enforcing finding of dependencies of blocks with PoW ancestors rather than returningSYNCING
could result in deadlocks.