Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimistic sync #3793

Merged
merged 12 commits into from
Jul 4, 2022
Merged

optimistic sync #3793

merged 12 commits into from
Jul 4, 2022

Conversation

tersec
Copy link
Contributor

@tersec tersec commented Jun 23, 2022

@tersec tersec marked this pull request as draft June 23, 2022 12:38
@tersec tersec force-pushed the R9I branch 2 times, most recently from cda098b to d0972b6 Compare June 23, 2022 19:30
@tersec tersec marked this pull request as ready for review June 23, 2022 19:57
@tersec tersec mentioned this pull request Jun 23, 2022
@github-actions
Copy link

github-actions bot commented Jun 24, 2022

Unit Test Results

     12 files  ±0     857 suites  ±0   1h 0m 48s ⏱️ - 2m 22s
1 712 tests ±0  1 660 ✔️ ±0    52 💤 ±0  0 ±0 
9 944 runs  ±0  9 816 ✔️ ±0  128 💤 ±0  0 ±0 

Results for commit cf1f410. ± Comparison against base commit 9eb1a3e.

♻️ This comment has been updated with latest results.

@tersec tersec changed the title [WIP] optimistic sync optimistic sync Jun 25, 2022
@tersec tersec force-pushed the R9I branch 5 times, most recently from 57d0ad2 to 68a31e3 Compare June 26, 2022 00:06
Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a few open TODO and when true that need to be fixed or logged as issues to avoid losing them.

Overall quite impressive how compact this change is implementation-wise.

beacon_chain/consensus_object_pools/block_clearance.nim Outdated Show resolved Hide resolved
beacon_chain/consensus_object_pools/block_clearance.nim Outdated Show resolved Hide resolved
beacon_chain/consensus_object_pools/blockchain_dag.nim Outdated Show resolved Hide resolved
beacon_chain/consensus_object_pools/blockchain_dag.nim Outdated Show resolved Hide resolved
Comment on lines 1417 to 1418
warn "markPayloadInvalid: attempt to invalidate valid block",
blck = shortLog(blck.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During initialization, only finalizedHead and head are populated into dag.optimisticRoots. Could it happen that this function is called on one of the intermediate blocks? Is this a condition that is expected in edge cases, or is this a programming error? If the latter, a doAssert verifyFinalization notin dag.updateFlags could be useful to detect this being hit.

Copy link
Contributor Author

@tersec tersec Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#transitioning-from-valid---invalidated-or-invalidated---valid describes, as far as this warning's topic:

These operations are purposefully omitted. It is outside of the scope of the specification since it's only possible with a faulty EE.

Such a scenario requires manual intervention.

It's possible that it'd be better to be even more emphatic than a warning, because this is a failure on par with doppelganger detection. Something went very wrong. That said, while it's not necessarily a Nimbus programming error, it is a programming error creating potential consensus failure, and verifyFinalization mode making this more obvious would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3c864f0 adds doAssert verifyFinalization notin dag.updateFlags

beacon_chain/validators/validator_duties.nim Outdated Show resolved Hide resolved
beacon_chain/validators/validator_duties.nim Outdated Show resolved Hide resolved
tests/test_block_processor.nim Outdated Show resolved Hide resolved
tests/test_block_processor.nim Outdated Show resolved Hide resolved
tests/test_block_processor.nim Outdated Show resolved Hide resolved
@tersec tersec force-pushed the R9I branch 3 times, most recently from 8100623 to e56c9d1 Compare June 30, 2022 08:54
@tersec
Copy link
Contributor Author

tersec commented Jun 30, 2022

Still a few open TODO and when true that need to be fixed or logged as issues to avoid losing them.

Overall quite impressive how compact this change is implementation-wise.

There's only one when true, and that it shows up in this PR artifact of refactoring the block processing loop. It was already there, and this PR isn't going to fix it. It's related to TTD detection functionality.

This PR now removes significantly more TODOs than it adds:

$ git diff -U0 f08f9e2bd7e1ba9fe55b02272189134871769031 | grep TODO
+  # TODO rename this to get_optimistic_head
+    # TODO have a third version which is fire-and-forget for when it is merge
+         # TODO self.consensusManager.eth1Monitor.terminalBlockHash.isSome
+         # TODO detect have-TTD-but-not-is_execution_block case, and where
-          # TODO self.consensusManager.eth1Monitor.terminalBlockHash.isSome
-    # TODO (cheatfate): Proper implementation required
-    # TODO (cheatfate): Proper implementation required
-    # TODO (cheatfate): Proper implementation required
-      # TODO (cheatfate): Proper implementation required
-          # TODO (cheatfate): Proper implementation required
-      # TODO (cheatfate): Proper implementation required.
-      # TODO (cheatfate): Proper implementation required.
-    # TODO (cheatfate): Proper implementation required.
-    # TODO (cheatfate): Proper implementation required
-    # TODO (cheatfate): Proper implementation required

These rest are pre-existing and/or out of scope/deliberately not part of this PR.

@tersec
Copy link
Contributor Author

tersec commented Jul 1, 2022

Tested successfully with syncs from scratch on Ropsten with each of Erigon, Besu, Geth, and Nethermind.

@@ -80,24 +80,124 @@ proc expectBlock*(self: var ConsensusManager, expectedSlot: Slot): Future[bool]

return fut

from eth/async_utils import awaitWithTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently had a discussion with a D programmer who promoted this style of function-local imports. I guess your rationale is the same - to make it easy to copy/paste functions to a different module by keeping their list of imports attached to the function itself.

One potential risk of this approach is that it also makes it much easier to accidentally introduce an import cycle while moving the function which will result in the typical hard-to-decipher error messages brought by cyclic imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it helps for that purpose.

Other advantages:

  • Nim 1.2 unused import detection is essentially broken, and Nim 1.6 is only somewhat better. This results in import lists simply accumulating because there's a 1,000-line module with a bunch of imports at the top, with no hint of what they're importing or why. Enough modules have some side-effect during import to disable the unused import warning, too. This makes it clearer that for modules imported not for their side effects (such as tests) if the imported symbols don't match, then the module import isn't necessary. It's not 100% automated, but it's better than just letting imports accumulate, as well as in theory providing Nim with more information for detecting unused imports. The more local and specific the imports, both in the symbols they name and where they're used, the better for this purpose.

It's related to one of the reasons I tend to unexport symbols when I notice them: it's not because of some OOP-thing about data hiding, it's about allowing Nim to detect dead code and, when reading or refactoring code, being confident that (unless there's an export statement with said symbol) changes in that code will only directly affect that one module, and that one doesn't have to scan the entire project. The deeper in the libraries one is working, the more this applies.

  • Various surprising overloads can bubble up from imported modules and cause surprising errors of their own (e.g., where
withState(someState):
  template f(): state.foo
  f()

worked right until some irrelevant-to-purpose symbol from deep inside Chronos's HTTP server was imported transitively, and meant that produced a confusing error message, requiring template f(): state().foo. Easy fix, a while to find the exact culprit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this interferes with Nim's cyclic import detection, IMO that's a Nim issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclic imports are legal in Nim, they just produce hard to diagnose error messages at times.
While I have no strong objections against this style, my comment meant to serve as a warning that one has to be careful with the notion that individual functions are now easy to move between modules if their imports are structured in this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclic imports are legal in Nim, they just produce hard to diagnose error messages at times.

they also produce invalid code with overly broad arguments, where the exact function that actually matches ends up depending on frivolous things like import order - there was no end to ssz import issues until all cyclical imports were removed - this is one more nim feature that sounds good on paper but is worthless practically because it's broken and untested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not advocating that they should be used. I was explaining that there is no "cyclic import detection" as the message above implied (it would be relatively easy to add to Nim though).

Otherwise, I have advocated many times that cyclic imports should be supported better in the language because their current state leads to the proliferation of the "types module" which makes encapsulation difficult to achieve in Nim:

https://forum.nim-lang.org/t/4293

@zah zah merged commit 1221bb6 into unstable Jul 4, 2022
@zah zah deleted the R9I branch July 4, 2022 20:35
@tersec tersec mentioned this pull request Jul 5, 2022
signedBlock: ForkySignedBeaconBlock, payloadValid: bool,
queueTick: Moment = Moment.now(),
validationDur = Duration()):
Future[Result[BlockRef, BlockError]] {.async.} =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design in which storeblock becomes async is a bit unfortunate - ie the processes that call it are not necessarily well prepared to lose their current in-memory state for long periods of time, specially not while waiting for a (potentially slow) EL response (since they're part of validator duties)

Copy link
Contributor Author

@tersec tersec Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could do something like traceAsyncErrors, or set up yet another queue, but there's fundamentally an asynchronous aspect to this, and it can't avoid but to change the storeBlock contract in some way -- either it can return, without any particular guarantee of "full" success (even detecting blatantly invalid blocks) or they need to wait. One minimal change is always having
await self.consensusManager.updateHeadWithExecution(wallTime.slotOrZero) there be either a queue or (implicitly a queue) traceAsyncErrors that would detach the two, and allow storeBlock itself to be synchronous again.

Another point I played with was within the updateHead functions -- the serial dependence here is forkchoiceUpdated (intrinsically a long-lasting call) having to be complete before one can, in general, tell, the DAG about blocks being optimistic or verified. It's possible to just have it fire off the fcU, then immediately call dag.updateHead, but that's generally not great. One could arrange a callback or similar, but that's the kind of pattern async/await more-or-less linerarizes control-flow-wise. So probably within updateHeadWithExecution, the async is warranted.

Copy link
Contributor Author

@tersec tersec Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general theme here is how much of the serial dependency is real vs how much is an artifact of a specific way of writing these functions (with async, with callbacks, with queues, with anything), but, in principle, to do the tasks they need to do to participate in the chain, how intrinsic vs how incidental are these delays.

One approach is to care even less about validity than it does now, and just ... not wait. Treat it as fire-and-semi-forget. Just import it if it's a candidate for optimistic sync, move on, etc. There might be some weird display quirks (e.g., the status bar showing wall slot != wall head briefly each slot even when everything goes well, or repeatedly showing opt sync briefly if it tries to be truly accurate for fractions of a second), but probably more important that the intrinsic dependencies exist: the aggregated attestations depend on the unaggregated attestations which depend on the correct head being known.

The challenge here is that https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#validator-assignments notes:

Validator assignments

An optimistic node is not a full node. It is unable to produce blocks, since an execution engine cannot produce a payload upon an unknown parent. It cannot faithfully attest to the head block of the chain, since it has not fully
verified that block.

Block Production

An optimistic validator MUST NOT produce a block (i.e., sign across the DOMAIN_BEACON_PROPOSER domain).

Attesting

An optimistic validator MUST NOT participate in attestation (i.e., sign across the DOMAIN_BEACON_ATTESTER, DOMAIN_SELECTION_PROOF or DOMAIN_AGGREGATE_AND_PROOF domains).

Participating in Sync Committees

An optimistic validator MUST NOT participate in sync committees (i.e., sign across the
DOMAIN_SYNC_COMMITTEE, DOMAIN_SYNC_COMMITTEE_SELECTION_PROOF or
DOMAIN_CONTRIBUTION_AND_PROOF domains).

To comply with the spec, this serialization enforced by await is required: if one wants to attest, and attest as soon as possible modulo the make-sure-everyone-gets-the-block-first delay, the block must be actually verified, not only optimistic.

So for a non-validating node, the literal stakes are lower, and for a validating node, there's ultimately no choice but to wait. At least this way, if the node has other things to do, it can do them in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3842 might address this. It needs testing, but should be a viable approach. It keeps enough of these invariants that it should remain valid, but be less invasive.

var
optForkchoiceHeadSlot = GENESIS_SLOT # safe default
optForkchoiceHeadRoot: Eth2Digest
from ../consensus_object_pools/blockchain_dag import
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny because in refactoring, I've been going in the opposite direction of exporting stuff much more consistently to avoid long import lists - this helps keep symbols that should be visible, visible - the template problem with the wrong symbol being matched is annoying but can often be solved by turning the template into a proc or generally giving things more descriptive names.. it's a mess, but with exports, it's actually easier to stay away from redundant imports (because there's simply fewer of them that can go stale)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also creates these environments have vastly more symbols than they need imported, with no automatic way to distinguish whether they're doing so by Nim, because those imports are indeed necessary, its just, only a small part of them was. When isolating reproducible test cases for Nim issue filing I run across these repeatedly -- some module needs three symbols from some relatively small modules, but for some slight convenience, it does

import large_umbrella_module

which in turn is something like

import a, b, c, d, e, f
export a, b, c, d, e, f

and then some of a, b, etc continue this fractally. It works, yes, but it also inhibits analysis of what imports are necessary. If only one module imported by large_umbrella_module turns out to be necessary after some refactoring, that will not be automatically detected in a visible way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect is that compile times are already problematic, and these single monolithic imports hinder attempts at improving that. Some of this is that there are individual modules which take a long time to compile, and no amount dependency management fixes that, but in other cases, the times of lots of small-to-medium modules being imported literally adds up. Even if/when Nim within a single "project" (nim c foo) doesn't keep recompiling modules, which, hopefully, it gets repeated across the nimbus_beacon_chain, nimbus_validator_client, et cetera.

One first step towards addressing this for any given module is clarifying what, exactly, it needs, and in various cases (e.g., the sqlite C code being compiled) that's meant splitting out testdbutils from testutils, explicitly avoiding this pile of import/export modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing the compiler internals, It's a bit hard to guess the effects on compilation times of this technique. You are right that each module in Nim is visited only once during compilation, so effectively a large project is likely to end up compiling the same set of files regardless of the import strategy.

The nim check command can benefit though because it may need to see a smaller subset of the project when you are close to the leaf modules in the import tree. There might also be some slight speedup brought by reducing the scope lookup tables and the numbers of candidates considered during overload resolution in some situations.

For me, the main downside of this technique are that:

  1. It requires quite a lot of work (unless your editor does it for you automatically)
  2. It makes it even harder to spot inappropriate import cycles.

Both can be addressed with better tooling.

Copy link
Member

@arnetheduck arnetheduck Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vastly more symbols than they need imported

this is true, but it is also often the case that a missing symbol is a lot more dangerous than an extra one - the former means a more generic overload will be chosen leading to subtle miserialization and similar issues due to nim-ser's open matching policy, the latter a (presumably fixable) compile-time error.

single monolithic imports hinder attempts at improving that.

we have two categories of monoliths - giants like chronos that import and export everything under their project, and "small" modules that have hard dependencies of their own - the former should of course be split up, but with the latter, the following happens very often: module A imports B, and C imports A - to use A from C, B must also be imported, so C now imports A and B - then A changes and no longer needs B it remains in C as an explicit import and nim's (broken) import analysis indeed rarely identifies it as unused.

For the latter case, I've found that it's easier to export hard dependencies - modules that use A then have a shorter import list of their own, which helps keep the list tight overall - there is no compile-time downside in this example because A has a hard dep on B, so B must be compiled no matter what..

We have two more issues:

  • we reuse types from implementation libraries - for example , we reuse eth1 block hashes in eth2, which brings in a lot of junk from eth1 instead of using just a minimal eth2-specific definition (ditto libp2p) - presumably this is done because copying is bad, but it also creates rigit ties between the project that are not always correct (just because an eth1 block has is serialized to hex in one way in eth1 doesn't mean eth2 does it the same way - these kinds of false carry-overs are quite problematic)
  • the p2p macro defines both client and server at the same time - this requires clients (like the VC) to import practically all of the server as well (VC doesn't need libp2p, but still imports it) - similar problems happen with other "categories" of imports, and the result is that "auxiliary" compiles like tests and simulations take longer even though they don't use a lot of the stuff that transitively ends up being processed - bearssl was a good example as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the former means a more generic overload will be chosen leading to subtle miserialization and similar issues due to nim-ser's open matching policy

That seems like an issue with nim-ser's open matching policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an issue with nim-ser's open matching policy.

yes, that's the biggest and most obvious culprit, and @zah has promised progress on this front, but it happens with other things generic as well (hash function which messes up Table, $, etc)

let newBlockRef = node.blockProcessor[].storeBlock(
MsgSource.api, wallTime, blck)
let newBlockRef = await node.blockProcessor.storeBlock(
MsgSource.api, wallTime, blck, payloadValid = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come payload is considered valid when block is coming from REST? ie rest is more or less gossip, from a trust-point-of-view

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants