optimistic sync #3793

tersec · 2022-06-23T12:38:29Z

https://github.com/ethereum/consensus-specs/blob/v1.2.0-rc.1/sync/optimistic.md

github-actions · 2022-06-24T03:19:27Z

Unit Test Results

    12 files ±0   857 suites ±0 1h 0m 48s ⏱️ - 2m 22s
1 712 tests ±0 1 660 ✔️ ±0   52 💤 ±0 0 ❌ ±0
9 944 runs ±0 9 816 ✔️ ±0 128 💤 ±0 0 ❌ ±0

Results for commit cf1f410. ± Comparison against base commit 9eb1a3e.

♻️ This comment has been updated with latest results.

etan-status

Still a few open TODO and when true that need to be fixed or logged as issues to avoid losing them.

Overall quite impressive how compact this change is implementation-wise.

beacon_chain/consensus_object_pools/block_clearance.nim

beacon_chain/consensus_object_pools/blockchain_dag.nim

etan-status · 2022-06-28T12:34:43Z

beacon_chain/consensus_object_pools/blockchain_dag.nim

+    warn "markPayloadInvalid: attempt to invalidate valid block",
+      blck = shortLog(blck.get)


During initialization, only finalizedHead and head are populated into dag.optimisticRoots. Could it happen that this function is called on one of the intermediate blocks? Is this a condition that is expected in edge cases, or is this a programming error? If the latter, a doAssert verifyFinalization notin dag.updateFlags could be useful to detect this being hit.

https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#transitioning-from-valid---invalidated-or-invalidated---valid describes, as far as this warning's topic:

These operations are purposefully omitted. It is outside of the scope of the specification since it's only possible with a faulty EE.

Such a scenario requires manual intervention.

It's possible that it'd be better to be even more emphatic than a warning, because this is a failure on par with doppelganger detection. Something went very wrong. That said, while it's not necessarily a Nimbus programming error, it is a programming error creating potential consensus failure, and verifyFinalization mode making this more obvious would be useful.

3c864f0 adds doAssert verifyFinalization notin dag.updateFlags

beacon_chain/validators/validator_duties.nim

tests/test_block_processor.nim

beacon_chain/consensus_object_pools/blockchain_dag.nim

tersec · 2022-06-30T20:17:18Z

Still a few open TODO and when true that need to be fixed or logged as issues to avoid losing them.

Overall quite impressive how compact this change is implementation-wise.

There's only one when true, and that it shows up in this PR artifact of refactoring the block processing loop. It was already there, and this PR isn't going to fix it. It's related to TTD detection functionality.

This PR now removes significantly more TODOs than it adds:

$ git diff -U0 f08f9e2bd7e1ba9fe55b02272189134871769031 | grep TODO
+  # TODO rename this to get_optimistic_head
+    # TODO have a third version which is fire-and-forget for when it is merge
+         # TODO self.consensusManager.eth1Monitor.terminalBlockHash.isSome
+         # TODO detect have-TTD-but-not-is_execution_block case, and where
-          # TODO self.consensusManager.eth1Monitor.terminalBlockHash.isSome
-    # TODO (cheatfate): Proper implementation required
-    # TODO (cheatfate): Proper implementation required
-    # TODO (cheatfate): Proper implementation required
-      # TODO (cheatfate): Proper implementation required
-          # TODO (cheatfate): Proper implementation required
-      # TODO (cheatfate): Proper implementation required.
-      # TODO (cheatfate): Proper implementation required.
-    # TODO (cheatfate): Proper implementation required.
-    # TODO (cheatfate): Proper implementation required
-    # TODO (cheatfate): Proper implementation required

These rest are pre-existing and/or out of scope/deliberately not part of this PR.

tersec · 2022-07-01T20:20:54Z

Tested successfully with syncs from scratch on Ropsten with each of Erigon, Besu, Geth, and Nethermind.

…block root filled in

zah · 2022-07-04T11:00:47Z

beacon_chain/gossip_processing/consensus_manager.nim

@@ -80,24 +80,124 @@ proc expectBlock*(self: var ConsensusManager, expectedSlot: Slot): Future[bool]

  return fut

+from eth/async_utils import awaitWithTimeout


I recently had a discussion with a D programmer who promoted this style of function-local imports. I guess your rationale is the same - to make it easy to copy/paste functions to a different module by keeping their list of imports attached to the function itself.

One potential risk of this approach is that it also makes it much easier to accidentally introduce an import cycle while moving the function which will result in the typical hard-to-decipher error messages brought by cyclic imports.

Sure, it helps for that purpose.

Other advantages:

Nim 1.2 unused import detection is essentially broken, and Nim 1.6 is only somewhat better. This results in import lists simply accumulating because there's a 1,000-line module with a bunch of imports at the top, with no hint of what they're importing or why. Enough modules have some side-effect during import to disable the unused import warning, too. This makes it clearer that for modules imported not for their side effects (such as tests) if the imported symbols don't match, then the module import isn't necessary. It's not 100% automated, but it's better than just letting imports accumulate, as well as in theory providing Nim with more information for detecting unused imports. The more local and specific the imports, both in the symbols they name and where they're used, the better for this purpose.

It's related to one of the reasons I tend to unexport symbols when I notice them: it's not because of some OOP-thing about data hiding, it's about allowing Nim to detect dead code and, when reading or refactoring code, being confident that (unless there's an export statement with said symbol) changes in that code will only directly affect that one module, and that one doesn't have to scan the entire project. The deeper in the libraries one is working, the more this applies.

Various surprising overloads can bubble up from imported modules and cause surprising errors of their own (e.g., where

withState(someState): template f(): state.foo f()

worked right until some irrelevant-to-purpose symbol from deep inside Chronos's HTTP server was imported transitively, and meant that produced a confusing error message, requiring template f(): state().foo. Easy fix, a while to find the exact culprit.

If this interferes with Nim's cyclic import detection, IMO that's a Nim issue.

Cyclic imports are legal in Nim, they just produce hard to diagnose error messages at times.
While I have no strong objections against this style, my comment meant to serve as a warning that one has to be careful with the notion that individual functions are now easy to move between modules if their imports are structured in this way.

Cyclic imports are legal in Nim, they just produce hard to diagnose error messages at times.

they also produce invalid code with overly broad arguments, where the exact function that actually matches ends up depending on frivolous things like import order - there was no end to ssz import issues until all cyclical imports were removed - this is one more nim feature that sounds good on paper but is worthless practically because it's broken and untested.

I was not advocating that they should be used. I was explaining that there is no "cyclic import detection" as the message above implied (it would be relatively easy to add to Nim though).

Otherwise, I have advocated many times that cyclic imports should be supported better in the language because their current state leads to the proliferation of the "types module" which makes encapsulation difficult to achieve in Nim:

https://forum.nim-lang.org/t/4293

arnetheduck · 2022-07-05T21:17:20Z

beacon_chain/gossip_processing/block_processor.nim

+    signedBlock: ForkySignedBeaconBlock, payloadValid: bool,
+    queueTick: Moment = Moment.now(),
+    validationDur = Duration()):
+    Future[Result[BlockRef, BlockError]] {.async.} =


this design in which storeblock becomes async is a bit unfortunate - ie the processes that call it are not necessarily well prepared to lose their current in-memory state for long periods of time, specially not while waiting for a (potentially slow) EL response (since they're part of validator duties)

It could do something like traceAsyncErrors, or set up yet another queue, but there's fundamentally an asynchronous aspect to this, and it can't avoid but to change the storeBlock contract in some way -- either it can return, without any particular guarantee of "full" success (even detecting blatantly invalid blocks) or they need to wait. One minimal change is always having
await self.consensusManager.updateHeadWithExecution(wallTime.slotOrZero) there be either a queue or (implicitly a queue) traceAsyncErrors that would detach the two, and allow storeBlock itself to be synchronous again.

Another point I played with was within the updateHead functions -- the serial dependence here is forkchoiceUpdated (intrinsically a long-lasting call) having to be complete before one can, in general, tell, the DAG about blocks being optimistic or verified. It's possible to just have it fire off the fcU, then immediately call dag.updateHead, but that's generally not great. One could arrange a callback or similar, but that's the kind of pattern async/await more-or-less linerarizes control-flow-wise. So probably within updateHeadWithExecution, the async is warranted.

A general theme here is how much of the serial dependency is real vs how much is an artifact of a specific way of writing these functions (with async, with callbacks, with queues, with anything), but, in principle, to do the tasks they need to do to participate in the chain, how intrinsic vs how incidental are these delays.

One approach is to care even less about validity than it does now, and just ... not wait. Treat it as fire-and-semi-forget. Just import it if it's a candidate for optimistic sync, move on, etc. There might be some weird display quirks (e.g., the status bar showing wall slot != wall head briefly each slot even when everything goes well, or repeatedly showing opt sync briefly if it tries to be truly accurate for fractions of a second), but probably more important that the intrinsic dependencies exist: the aggregated attestations depend on the unaggregated attestations which depend on the correct head being known.

The challenge here is that https://github.com/ethereum/consensus-specs/blob/dev/sync/optimistic.md#validator-assignments notes:

Validator assignments

An optimistic node is not a full node. It is unable to produce blocks, since an execution engine cannot produce a payload upon an unknown parent. It cannot faithfully attest to the head block of the chain, since it has not fully
verified that block.

Block Production

An optimistic validator MUST NOT produce a block (i.e., sign across the DOMAIN_BEACON_PROPOSER domain).

Attesting

An optimistic validator MUST NOT participate in attestation (i.e., sign across the DOMAIN_BEACON_ATTESTER, DOMAIN_SELECTION_PROOF or DOMAIN_AGGREGATE_AND_PROOF domains).

Participating in Sync Committees

An optimistic validator MUST NOT participate in sync committees (i.e., sign across the
DOMAIN_SYNC_COMMITTEE, DOMAIN_SYNC_COMMITTEE_SELECTION_PROOF or
DOMAIN_CONTRIBUTION_AND_PROOF domains).

To comply with the spec, this serialization enforced by await is required: if one wants to attest, and attest as soon as possible modulo the make-sure-everyone-gets-the-block-first delay, the block must be actually verified, not only optimistic.

So for a non-validating node, the literal stakes are lower, and for a validating node, there's ultimately no choice but to wait. At least this way, if the node has other things to do, it can do them in the meantime.

#3842 might address this. It needs testing, but should be a viable approach. It keeps enough of these invariants that it should remain valid, but be less invasive.

arnetheduck · 2022-07-05T21:33:55Z

beacon_chain/gossip_processing/block_processor.nim

-  var
-    optForkchoiceHeadSlot = GENESIS_SLOT # safe default
-    optForkchoiceHeadRoot: Eth2Digest
+from ../consensus_object_pools/blockchain_dag import


funny because in refactoring, I've been going in the opposite direction of exporting stuff much more consistently to avoid long import lists - this helps keep symbols that should be visible, visible - the template problem with the wrong symbol being matched is annoying but can often be solved by turning the template into a proc or generally giving things more descriptive names.. it's a mess, but with exports, it's actually easier to stay away from redundant imports (because there's simply fewer of them that can go stale)

It also creates these environments have vastly more symbols than they need imported, with no automatic way to distinguish whether they're doing so by Nim, because those imports are indeed necessary, its just, only a small part of them was. When isolating reproducible test cases for Nim issue filing I run across these repeatedly -- some module needs three symbols from some relatively small modules, but for some slight convenience, it does

import large_umbrella_module

which in turn is something like

import a, b, c, d, e, f export a, b, c, d, e, f

and then some of a, b, etc continue this fractally. It works, yes, but it also inhibits analysis of what imports are necessary. If only one module imported by large_umbrella_module turns out to be necessary after some refactoring, that will not be automatically detected in a visible way.

Another aspect is that compile times are already problematic, and these single monolithic imports hinder attempts at improving that. Some of this is that there are individual modules which take a long time to compile, and no amount dependency management fixes that, but in other cases, the times of lots of small-to-medium modules being imported literally adds up. Even if/when Nim within a single "project" (nim c foo) doesn't keep recompiling modules, which, hopefully, it gets repeated across the nimbus_beacon_chain, nimbus_validator_client, et cetera.

One first step towards addressing this for any given module is clarifying what, exactly, it needs, and in various cases (e.g., the sqlite C code being compiled) that's meant splitting out testdbutils from testutils, explicitly avoiding this pile of import/export modules.

Knowing the compiler internals, It's a bit hard to guess the effects on compilation times of this technique. You are right that each module in Nim is visited only once during compilation, so effectively a large project is likely to end up compiling the same set of files regardless of the import strategy.

The nim check command can benefit though because it may need to see a smaller subset of the project when you are close to the leaf modules in the import tree. There might also be some slight speedup brought by reducing the scope lookup tables and the numbers of candidates considered during overload resolution in some situations.

For me, the main downside of this technique are that:

It requires quite a lot of work (unless your editor does it for you automatically)

It makes it even harder to spot inappropriate import cycles.

Both can be addressed with better tooling.

vastly more symbols than they need imported

this is true, but it is also often the case that a missing symbol is a lot more dangerous than an extra one - the former means a more generic overload will be chosen leading to subtle miserialization and similar issues due to nim-ser's open matching policy, the latter a (presumably fixable) compile-time error.

single monolithic imports hinder attempts at improving that.

we have two categories of monoliths - giants like chronos that import and export everything under their project, and "small" modules that have hard dependencies of their own - the former should of course be split up, but with the latter, the following happens very often: module A imports B, and C imports A - to use A from C, B must also be imported, so C now imports A and B - then A changes and no longer needs B it remains in C as an explicit import and nim's (broken) import analysis indeed rarely identifies it as unused.

For the latter case, I've found that it's easier to export hard dependencies - modules that use A then have a shorter import list of their own, which helps keep the list tight overall - there is no compile-time downside in this example because A has a hard dep on B, so B must be compiled no matter what..

We have two more issues:

we reuse types from implementation libraries - for example , we reuse eth1 block hashes in eth2, which brings in a lot of junk from eth1 instead of using just a minimal eth2-specific definition (ditto libp2p) - presumably this is done because copying is bad, but it also creates rigit ties between the project that are not always correct (just because an eth1 block has is serialized to hex in one way in eth1 doesn't mean eth2 does it the same way - these kinds of false carry-overs are quite problematic)

the p2p macro defines both client and server at the same time - this requires clients (like the VC) to import practically all of the server as well (VC doesn't need libp2p, but still imports it) - similar problems happen with other "categories" of imports, and the result is that "auxiliary" compiles like tests and simulations take longer even though they don't use a lot of the stuff that transitively ends up being processed - bearssl was a good example as well

the former means a more generic overload will be chosen leading to subtle miserialization and similar issues due to nim-ser's open matching policy

That seems like an issue with nim-ser's open matching policy.

That seems like an issue with nim-ser's open matching policy.

yes, that's the biggest and most obvious culprit, and @zah has promised progress on this front, but it happens with other things generic as well (hash function which messes up Table, $, etc)

arnetheduck · 2022-07-05T21:39:38Z

beacon_chain/validators/validator_duties.nim

-      let newBlockRef = node.blockProcessor[].storeBlock(
-        MsgSource.api, wallTime, blck)
+      let newBlockRef = await node.blockProcessor.storeBlock(
+        MsgSource.api, wallTime, blck, payloadValid  = true)


how come payload is considered valid when block is coming from REST? ie rest is more or less gossip, from a trust-point-of-view

tersec marked this pull request as draft June 23, 2022 12:38

tersec force-pushed the R9I branch 2 times, most recently from cda098b to d0972b6 Compare June 23, 2022 19:30

tersec marked this pull request as ready for review June 23, 2022 19:57

tersec mentioned this pull request Jun 23, 2022

[WIP] floating tip sync #3788

Closed

tersec changed the title ~~[WIP] optimistic sync~~ optimistic sync Jun 25, 2022

tersec force-pushed the R9I branch 5 times, most recently from 57d0ad2 to 68a31e3 Compare June 26, 2022 00:06

tersec mentioned this pull request Jun 27, 2022

REST: fix timing check for bellatrix epoch. #3807

Merged

etan-status reviewed Jun 28, 2022

View reviewed changes

narimiran reviewed Jun 29, 2022

View reviewed changes

beacon_chain/consensus_object_pools/blockchain_dag.nim Outdated Show resolved Hide resolved

tersec force-pushed the R9I branch 3 times, most recently from 8100623 to e56c9d1 Compare June 30, 2022 08:54

tersec force-pushed the R9I branch from 7d08e28 to 56d8bb1 Compare July 1, 2022 00:27

tersec force-pushed the R9I branch from a3293d1 to 5bc5b61 Compare July 2, 2022 23:37

tersec added 8 commits July 2, 2022 23:37

optimistic sync

050c59b

flag that initially loaded blocks from database might need execution …

86dd431

…block root filled in

return optimistic status in REST calls

de011f3

refactor blockslot pruning

5e4ed58

ensure beacon_blocks_by_{root,range} do not provide optimistic blocks

990e649

handle forkchoice head being pre-merge with block being postmerge

58cd8f0

re-enable blocking head updates on validator duties

3055264

revert last two commits

37bd2e8

tersec added 4 commits July 2, 2022 23:38

fix is_optimistic_candidate_block per spec; don't crash with nil future

66acf79

revert last commit

9d8cd82

fix is_optimistic_candidate_block per spec; don't crash with nil future

e1e15e1

mark blocks sans execution payloads valid during head update

cf1f410

tersec force-pushed the R9I branch from 5bc5b61 to cf1f410 Compare July 2, 2022 23:38

zah reviewed Jul 4, 2022

View reviewed changes

zah merged commit 1221bb6 into unstable Jul 4, 2022

zah deleted the R9I branch July 4, 2022 20:35

tersec mentioned this pull request Jul 5, 2022

Introduce message router #3829

Merged

arnetheduck reviewed Jul 5, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimistic sync #3793

optimistic sync #3793

tersec commented Jun 23, 2022 •

edited

Loading

github-actions bot commented Jun 24, 2022 •

edited

Loading

etan-status left a comment

etan-status Jun 28, 2022

tersec Jun 28, 2022 •

edited

Loading

tersec Jun 29, 2022

tersec commented Jun 30, 2022 •

edited

Loading

tersec commented Jul 1, 2022

zah Jul 4, 2022

tersec Jul 4, 2022

tersec Jul 4, 2022

zah Jul 4, 2022

arnetheduck Jul 5, 2022

zah Jul 6, 2022

arnetheduck Jul 5, 2022

tersec Jul 5, 2022 •

edited

Loading

tersec Jul 5, 2022 •

edited

Loading

tersec Jul 6, 2022

arnetheduck Jul 5, 2022

tersec Jul 5, 2022

tersec Jul 5, 2022

zah Jul 6, 2022

arnetheduck Jul 6, 2022 •

edited

Loading

tersec Jul 6, 2022

arnetheduck Jul 6, 2022

arnetheduck Jul 5, 2022

tersec Jul 6, 2022

		warn "markPayloadInvalid: attempt to invalidate valid block",
		blck = shortLog(blck.get)

		@@ -80,24 +80,124 @@ proc expectBlock*(self: var ConsensusManager, expectedSlot: Slot): Future[bool]

		return fut

		from eth/async_utils import awaitWithTimeout

optimistic sync #3793

optimistic sync #3793

Conversation

tersec commented Jun 23, 2022 • edited Loading

github-actions bot commented Jun 24, 2022 • edited Loading

Unit Test Results

etan-status left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tersec Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tersec commented Jun 30, 2022 • edited Loading

tersec commented Jul 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tersec Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

tersec Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

Validator assignments

Block Production

Attesting

Participating in Sync Committees

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tersec commented Jun 23, 2022 •

edited

Loading

github-actions bot commented Jun 24, 2022 •

edited

Loading

tersec Jun 28, 2022 •

edited

Loading

tersec commented Jun 30, 2022 •

edited

Loading

tersec Jul 5, 2022 •

edited

Loading

tersec Jul 5, 2022 •

edited

Loading

arnetheduck Jul 6, 2022 •

edited

Loading