-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd,internal/era: implement export-history
subcommand
#26621
Conversation
Is this PR superseding #25325 ? |
internal/e2store/e2store.go
Outdated
length += uint64(b[6]) << 32 | ||
length += uint64(b[7]) << 40 | ||
|
||
val := make([]byte, length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unsafe. length can be 255 PB
, I think this should be capped to maybe 2GB or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangent: why so large length even?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I realized this, was trying to understand how rlp
handles this since the format technically supports strings of size 2^((0xc0-0xb7)*8)
. Seems like it tries to interrogate the underlying type to see if it can figure out how much data is available to be read.
Is there a better way to deal with this? Suppose length
for whatever reason is extremely large (large than we can put in memory)? I kind of think we should not support that case, but it would be better fail gracefully than panic here.
Yes this supersedes #25325! |
654a38e
to
8e9440b
Compare
9740e76
to
2ebce14
Compare
This comment was marked as duplicate.
This comment was marked as duplicate.
careful with the byte order here - types are arrays of 2 bytes (not integers) to avoid endian-ness issues - in the e2store spec, the version is
I wonder if we should store the full tree here or add the list of hashes as a separate entry - in the beacon chain, the root of each block is available from the state allowing the verifier to check blocks one by one - when a file is trusted, this also acts as a quick way to construct a hash->number index, without having to actually hash the data. |
b7bac91
to
4001b0c
Compare
You mean store the full list of header records (I updated the defn btw, I realize it should have been a |
If you want to run this yourself and test it, try the following: $ geth export-history eras 0 15537393 8192 This will export all the pre-merge blocks. For me, it takes around 6 hrs. You can verify them by running: $ wget https://gist.githubusercontent.com/lightclient/528b95ffe434ac7dcbca57bff6dd5bd1/raw/fd660cfedb65cd8f133b510c442287dc8a71660f/roots.txt
$ era verify roots.txt This takes around 3.5 hours. |
excellent - either list of full tree works well, full list is certainly enough! |
00c27ff
to
b5aa828
Compare
@arnetheduck I implemented a the hashes section in a separate branch. Not sure if it makes sense to include here right now, can always add later (backwards compatible) if it's something that we really need. -- I think this PR is ready to be reviewed again. @holiman, I took a look at changing this PR to only use the freezer, but ultimately I need the genesis block too (which IIUC is not available from freezer), so would need to already open the datadir to get that. I also am getting the network id to determine the naming scheme (would be easy to replace this with cli parameter though). Here is the branch associated with the attempt. Let me know if you still feel strongly about switching to freezer-only format and maybe we can work out a different way! |
b52f3f0
to
44c311e
Compare
This comment was marked as outdated.
This comment was marked as outdated.
fbaa0af
to
985cbbe
Compare
f2c9253
to
e9cb207
Compare
Ci is red |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in the process of implementing era1 for Nimbus and while doing so it came to my attention that Snappy compression is non-deterministic. Hence, I don't think we can use the checksums verification cross-client/implementation.
So I think that part should be removed from the specification when it gets more formalized.
Some smaller suggestions to the specification in the PR description:
- It should be mentioned that TotalDifficulty's uint256 must be LE encoded (as it is not rlp or ssz encoded).
- Change the name of Accumulator to AccumulatorRoot to better represent the actual data.
And lastly I was wondering if it would make sense to add an Index also for the accumulator root? Similar as is done in consensus era files for the full state.
Unless I misunderstand the specs, I think currently one would have to jump all the records until its hits the Accumulator root. (Well, one could read the BlockIndex and jump immediately to the 8191th block tuple records and skip those, but that feels a bit like a hack).
Regarding use cases of getting the Accumulator root individually. Typically it would be read after also iterating over all the block-tuples to actually build the Accumulator for verification. So in that scenario it is not really needed.
But one use case could be when you want to verify the accumulator root with values you have out-of-band, before you get into reading/verifying all the data.
Agree that they cannot be used for cross-client verification. I moved around the note about them and added they are not part of the spec. I do intend to share them alongside our era1 files though for convenience of verification. I think other clients can do the same, or we can decide on a single implementation to share. This works for era1 because we only need to do it once, we aren't constantly generating them each time.
Updated to mention it is LE encoded.
Updated.
I think we should keep things simple. As you said, you can technically just traverse to before the block index if you really need the performance. |
...and version. I don't quite see the use case of "simple verification" here, in the sense that anyone ingesting the files will have the means to decode them as well and the data is already covered by crc32 for "trivial" verification (because snappy framed). We provide a trivial CLI for verifying the integrity of files for the purpose of post-download verification (similar to how one would verify with sha256sum)
An independent implementation should be able to generate a valid era1 file to verify that the data was correctly generated by the other implementation - this implies that the spec should have a section on tests that the files should pass (similar to era for eth2) |
Fair enough, I wasn't sure about this suggestion myself. I've now implemented it by just reversing the size of the record from the block index position. |
It takes a very long time to verify the data integrity because you need to recompute the tx hash and receipt hash. On the order of hours. I don't have a recent benchmark. So the checksum is extremely fast verification that data is correct. This is especially useful if you're going to just use era1 as a drop in db backend for block look ups. No need to spend hours validating / ingesting.
Not sure exactly which tests you're referring to? I can add something about clients should be able to i) compute the same accumulator root for the era1 and iii) verify each total difficulty is correct. |
why does this take significantly longer than computing a file hash? ie it's a similar order of magnitude of data to be hashed.
We ended up doing this for era but without any such intermediate verification - basically even just reading a large archive like this takes time and because the consumer must do a verification anyway, typically (say if you're serving the data over the wire) it doesn't greatly matter for the provider. If on the other hand you want to use the data (ingest it), it's not a bad idea to check that it is well-formed and leads to the actual block you expect. The in-between seems like it'll still take a not insignificant amount of time, is non-deterministic and says more or less that the file transfer worked with a bit higher confidence than the already-present crc32. FWIW, I'm curious about this case for the simple reason that we could add a header in the file that contains a hash as a new "type", but ... What we do for verification in consensus land is read the header of the beacon state which contains the block tree hashes then verify those - if I were to guess, this takes 2-3x longer than simply hashing the data, which in the grand scheme of things isn't much of a difference. The other trick, taken from the crc32, would be to focus on the uncompressed data and hash that - this would be compatible across all implementations (obviously). |
There is a lot of copying, reflection, and serializing to compute the roots. We have to read the block body into our block body structure, then get the transactions out of it, re-serialize, compute root, etc. I would be curious how fast other clients can do this. Of course it can be done faster, but at a rather large cost to code complexity. -- I don't really understand what you mean in your second comment by "intermediate verification" and "in-between". Do you think there needs to be an additional type in the era1 format? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's goo
I mean that verifying a sha256sum of a file generated in a non-deterministic way (due to snappy) provides little value in the end - it still takes time, doesn't bring cross-client guarantees or "data idempotency" meaning we can't post those hashes as "universal truth" without trusting a single-client implementation and for the end user, it brings limited value in the sense that transmission errors will get detected anyway and for "real" verification of the file, a more elaborate process is needed regardless ("this data corresponds to the consensus of the merge block"). This is probably not a hill to die on, but tying the hashes to a particular go-snappy version and its particular usage in geth feels off. We could solve these problems with a new type in the era1 format that hashes the uncompressed data which would address the performance question probably but that feels like an optimization and it's questionable if the complexity is worth it. |
I'm not proposing we make the checksums a universal source of truth. That is what the accumulator root is for. It's just a convenient option for go-ethereum users and doesn't impact the standard. Other clients do not need to do this. |
We'll merge this code in, it doesn't mean that the format / implementation is final. Merging it in makes possible to test and also further experiment with integration with the freezer. |
* params: begin v.1.13.12 release cycle * internal/flags: fix typo (ethereum#28876) * core/types: fix and test handling of faulty nil-returning signer (ethereum#28879) This adds an error if the signer returns a nil value for one of the signature value fields. * README.md: fix travis badge (ethereum#28889) The hyperlink in the README file that directs to the Travis CI build was broken. This commit updates the link to point to the corrent build page. * eth/catalyst: allow payload attributes v1 in fcu v2 (ethereum#28882) At some point, `ForkchoiceUpdatedV2` stopped working for `PayloadAttributesV1` while `paris` was active. This was causing a few failures in hive. This PR fixes that, and also adds a gate in `ForkchoiceUpdatedV1` to disallow `PayloadAttributesV3`. * docs/postmortems: fix outdated link (ethereum#28893) * core: reset tx lookup cache if necessary (ethereum#28865) This pull request resets the txlookup cache if chain reorg happens, preventing them from remaining reachable. It addresses failures in the hive tests. * build: fix problem with windows line-endings in CI download (ethereum#28900) fixes ethereum#28890 * eth/downloader: fix skeleton cleanup (ethereum#28581) * eth/downloader: fix skeleton cleanup * eth/downloader: short circuit if nothing to delete * eth/downloader: polish the logic in cleanup * eth/downloader: address comments * deps: update memsize (ethereum#28916) * core/txpool/blobpool: post-crash cleanup and addition/removal metrics (ethereum#28914) * core/txpool/blobpool: clean up resurrected junk after a crash * core/txpool/blobpool: track transaction insertions and rejections * core/txpool/blobpool: linnnnnnnt * core/txpool: don't inject lazy resolved transactions into the container (ethereum#28917) * core/txpool: don't inject lazy resolved transactions into the container * core/txpool: minor typo fixes * core/types: fix typo (ethereum#28922) * p2p: fix accidental termination of portMappingLoop (ethereum#28911) * internal/flags: fix --miner.gasprice default listing (ethereum#28932) * all: fix typos in comments (ethereum#28881) * Makefile: add help target to display available targets (ethereum#28845) Co-authored-by: Martin HS <[email protected]> Co-authored-by: Felix Lange <[email protected]> * core: cache transaction indexing tail in memory (ethereum#28908) * eth, miner: fix enforcing the minimum miner tip (ethereum#28933) * eth, miner: fix enforcing the minimum miner tip * ethclient/simulated: fix failing test due the min tip change * accounts/abi/bind: fix simulater gas tip issue * core/state, core/vm: minor uint256 related perf improvements (ethereum#28944) * cmd,internal/era: implement `export-history` subcommand (ethereum#26621) * all: implement era format, add history importer/export * internal/era/e2store: refactor e2store to provide ReadAt interface * internal/era/e2store: export HeaderSize * internal/era: refactor era to use ReadAt interface * internal/era: elevate anonymous func to named * cmd/utils: don't store entire era file in-memory during import / export * internal/era: better abstraction between era and e2store * cmd/era: properly close era files * cmd/era: don't let defers stack * cmd/geth: add description for import-history * cmd/utils: better bytes buffer * internal/era: error if accumulator has more records than max allowed * internal/era: better doc comment * internal/era/e2store: rm superfluous reader, rm superfluous testcases, add fuzzer * internal/era: avoid some repetition * internal/era: simplify clauses * internal/era: unexport things * internal/era,cmd/utils,cmd/era: change to iterator interface for reading era entries * cmd/utils: better defer handling in history test * internal/era,cmd: add number method to era iterator to get the current block number * internal/era/e2store: avoid double allocation during write * internal/era,cmd/utils: fix lint issues * internal/era: add ReaderAt func so entry value can be read lazily Co-authored-by: lightclient <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]> * internal/era: improve iterator interface * internal/era: fix rlp decode of header and correctly read total difficulty * cmd/era: fix rebase errors * cmd/era: clearer comments * cmd,internal: fix comment typos --------- Co-authored-by: Martin Holst Swende <[email protected]> * core,params: add holesky to default genesis function (ethereum#28903) * node, rpc: add configurable HTTP request limit (ethereum#28948) Adds a configurable HTTP request limit, and bumps the engine default * all: fix docstring names (ethereum#28923) * fix wrong comment * reviewers input * Update log/handler_glog.go --------- Co-authored-by: Martin HS <[email protected]> * ethclient/simulated: fix typo (ethereum#28952) (ethclient/simulated):fix typo * eth/gasprice: fix percentile validation in eth_feeHistory (ethereum#28954) * cmd/devp2p, eth: drop support for eth/67 (ethereum#28956) * params, core/forkid: add mainnet timestamp for Cancun (ethereum#28958) * params: add cancun timestamp for mainnet * core/forkid: add test for mainnet cancun forkid * core/forkid: update todo tests for cancun * internal/ethapi: add support for blobs in eth_fillTransaction (ethereum#28839) This change adds support for blob-transaction in certain API-endpoints, e.g. eth_fillTransaction. A follow-up PR will add support for signing such transactions. * internal/era: update block index format to be based on record offset (ethereum#28959) As mentioned in ethereum#26621, the block index format for era1 is not in line with the regular era block index. This change modifies the index so all relative offsets are based against the beginning of the block index record. * params: go-ethereum v1.13.12 stable --------- Co-authored-by: Martin Holst Swende <[email protected]> Co-authored-by: alex <[email protected]> Co-authored-by: protolambda <[email protected]> Co-authored-by: KeienWang <[email protected]> Co-authored-by: lightclient <[email protected]> Co-authored-by: rjl493456442 <[email protected]> Co-authored-by: Péter Szilágyi <[email protected]> Co-authored-by: zoereco <[email protected]> Co-authored-by: Chris Ziogas <[email protected]> Co-authored-by: Dimitris Apostolou <[email protected]> Co-authored-by: Halimao <[email protected]> Co-authored-by: Felix Lange <[email protected]> Co-authored-by: lmittmann <[email protected]> Co-authored-by: Sina Mahmoodi <[email protected]> Co-authored-by: Austin Roberts <[email protected]>
This PR is allows users to export their chain into an an archive format called Era1. It is formulated similarly to the Era1 format, which is optimized for reading and distribution CL data. The Era and Era1 format are stricter subsets of a simple type-length-value scheme called e2store2, both developed by the Nimbus team.
The Era format was originally designed to distribute Beacon chain data, but it is clearly desirable to have the same functionality for the EL. A shareable and verifiable archive format is generally considered the first step towards implementing history pruning3.
For these reasons, a special flavor of the Era format was developed4: Era1. It's goal is to service EL history pre-merge. The reason the format only concerns itself with data before the merge is because post-merge all EL data is encapsulated within the Beacon chain and therefore the existing Era scheme is sufficient.
Specification
The format can be summarized with the following expression:
Each basic element is its own e2store entry:
TotalDifficulty
is little-endian encoded.HeaderRecord
is defined in the Portal Network specification5.BlockIndex
stores relative offsets to each compressed block entry. The format is:All values in the block index are little-endian
uint64
.starting-number
is the first block number in the archive. Every index is a defined relative to index's location in the file. The total number of block entries in the file is recorded in count.Due to the accumulator size limit of 8192, the maximum number of blocks in an Era batch is also 8192. This is also the value of
SLOTS_PER_HISTORICAL_ROOT
6 on the Beacon chain, so it is nice to align on the value.Verification
There are two verification paths.
Accumulator Roots
The accumulator roots are hash tree roots of
HeaderRecord
s5. This is a relatively format-agnostic verification method. So if Era1 changes in the future, the accumulator values will not as it is shared with the Portal Network. However, verifying the accumulators is expensive because the underlying transactions and receipts need to be verified against their headers before the overall accumulator is checked. Once this is complete, theHeaderRecord
s can be generatedhtr
'd. Theera
binary is provided bycmd/era
. For a full list of accumulator values, seeaccumulators.md
.$ era verify accumulators.txt
Checksums
Note: checksums are platform dependent due to the lack of byte-for-byte conformance of snappy. They are for convenience if clients wish to distribute them.
In order to provide fast verification of integrity, we also provide checksum values of each Era1 file. The checksum is simply the
sha256
digest of the Era1 file. This requires trusting a third party's checksum values.See
checksums.txt
for a full list of checksum values.Footnotes
https://github.com/status-im/nimbus-eth2/blob/stable/docs/e2store.md#era-files ↩
https://github.com/status-im/nimbus-eth2/blob/stable/docs/e2store.md ↩
https://eips.ethereum.org/EIPS/eip-4444 ↩
https://hackmd.io/@arnetheduck/H15vMzx2s ↩
https://github.com/ethereum/portal-network-specs/blob/master/history-network.md#the-header-accumulator ↩ ↩2
https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/beacon-chain.md#time-parameters ↩