Garbage collector #2111

karalabe · 2016-01-08T17:26:51Z

Geth garbage collection

Currently the database storing the Go Ethereum blockchain is around 5.5 GB for a normal, sparingly running node (i.e. my develop instance). Beside all the live information that the node actually requires to function, there are various leftovers from past transaction processing and micro forks:

Contract state entries that are no longer relevant (contract state changed since)
Global state trie nodes that are no longer referenced by the current blockchain
Headers, blocks and receipts that were in a small fork, but overruled later

All of the above leftovers significantly increase the size of the database, but none of them are of actual value (unless you want to inspect past state). If we do a fast sync, gathering only the most essential data required to run the node, the database will weigh around 1GB, meaning that for an average node we have a junk multiplier of about 5x, with miners who regularly generate uncles and mini forks this probably being much higher.

As the biggest bottleneck we currently have is the disk IO, it's essential that we keep the database size down to the minimum possible. The most obvious step at the moment to achieve this is to introduce a garbage collection into Geth that will constantly discard non useful data.

This entails deleting forked data (i.e. chain elements that were overruled) and out of date state data (i.e. state trie nodes not part of the current top N blocks). In the next sections I'll describe a set of garbage collection algorithms (along with the required database modification required to enable them) that should be able to keep the accumulated junk down, while at the same time being self correcting (junk should eventually disappear, even after a crashed node or aborted GC run).

Note: the algorithms are designed with huge datasets in mind. If the chain/state database can fit into memory, even relatively naive algorithms can perform reasonably well, but our aim is to make these work as seamlessly as possible both CPU, memory and disk io wise.

State trie pruning

Discarded solution: Reference counting

The first investigated solution was the one proposed by Vitalik in his State Trie Pruning blog post.

The exact internal data representation was not clearly defined, but the solution entailed keeping a counter for each node in the database with the number of entities referencing it, marking the trie node as "deletable" if the reference reaches zero. After some number of blocks pass, nodes marked deletable and still not referenced are dropped.

The issue with this proposal was that the blog post itself is very confusing, very scarce on the details and describes a significant complexity not warranted for (special trie-node death rows, journaling system for rollbacks and replays, trie resurrections, etc).

Although the proposal could be simplified to work well in a reliable/perfect environment, the reference counters are extremely unstable in the face of crashes and or power outages where only part of the trie' counters are synchronized. Since no metadata is maintained about the origin of a reference, a partial write will cripple the algorithm: depending on the order of the prune operations, a counter may be decremented twice or never if a crash occurs exactly between the deletion of a parent and the decrement of a counter.

Lesson learnt: it doesn't matter how the pruning algorithm works, but selfcorrection after a crash is an asbolute necessity to prevent the accumulation of undeletable junk.

Discarded solution: Bloom filters

The second investigated solution was an idea based on Bloom Filters, a probabilistic data structure that can answer whether a set probably or surely doesn't contain a specific element. The base bloom filters have a limit on the number of elements it can handle, which needs to be approximately known a-priori, making it unsuitable for cases with unknown set sizes. An extension - Scalable Bloom Filters - however can work around this issue, by using partitioned filters that are constantly expanded whenever the previous batch reaches 0.5 saturation.

The algorithm idea was to traverse the entire state trie of the top N blocks, and insert each of the trie node hashes into a scalable bloom filter. Afterwards we can scan the entire state database and discard any nodes not present in the filter. This would leave some percentage of dead nodes not deleted (the error ratio of the bloom filter). While the algorithm wouldn't have been a perfect solution, it was a self-correcting one, didn't require any extra storage in the database, only a runtime memory and procesing cost to assemble the bloom filter and cross check.

We've put together a small test script to iterate a single state trie and assemble a bloom filter out of it.

The results were mostly a let down:

The test database was a live full node chain database of 5.5 GB (i.e. not pruned, not fast synced)
The head of the blockchain contained a state trie of ~1.6M nodes, 550K unique database entiries
Iterating and inserting the state trie into a scalable bloom filter took ~24 seconds on a ZenBook Pro and consumed 7-12MB memory depending a filter configs

The takeaway is that even the current relatively small size of the state trie already took considerable resources memory but especially processing wise. These would become more and more problematic on small devices with slow disk access which won't be able to keep up with the garbage collection requirements.

Lesson learnt: the pruning algorithm must be able to run online with the rest of the system, without inducing noticable pauses; further it's better to have a somewhat slower algorithm that can delete junk non-stop than one which needs an expensive upfront collection phase, which may be aborted before it can even delete a single obsolete entry.

Discarded solution: Graph databases

Based on the previous failed attempt with bloom filters, it's clear that we need to sacrifice some data storage in favor of maintaining trie metadata to aid in pruning; but based on the failed reference couting attempt it's also clear we need a solution where the metadata is descriptive enough for self-correction.

The third pruning experiment was done by turning towards graph databases. The idea was that instead of storing the trie in a key/value store mapping hashes to node RLPs, we could extend the storage to also save all the relationship metadata between the trie nodes (and possible other system data). Having the trie represented as a graph (i.e. two directional edges between nodes) opposed to a plain tree (parent -> child traversals possible only), it would become trivial to prune nodes:

Whenever the last parent of a node is deleted, delete the node itself too
Whenever a node is deleted, delete all it's links to children

Furthermore, this explicit graph representation lends itself well to self-correction, as it's easy to add an extra garbage collection run that scans through the database and prunes previous leftovers post crash, as all the needed relationship data is present and cannot be corrupted.

To verify our little experiment, we've tried to embed a state trie into the Cayley graph database, chosen primarily because of it being written in pure Go (at least using the leveldb and bolt backends), being easily embeddable into our codebase. We've experimented with both the bolt and leveldb backends, but bolt turned out to be very underperforming so we used leveldb for the remainder of our evaluation.

The first attempt embedded the entire state trie into Cayley (relationships + node RLP), which led to a database of about 1.2 GB in size. As it turned out, the leveldb backend in Cayley serializes all vertex contents into a json structure, which is very innapropriate for Ethereum, as most of the data is binary RLP encoded, so embedding into json would inflate it significantly.

A second attempt was also made to move only the trie relationships (i.e. graph edges) into Cayley, and use it as a metadata database to aid the primary chain database. While this did save a significant amount of space, the database was still ~450 MB in size, which warranted a bit of exploration as to how Cayley exactly represents and stores its graphs inside the leveldb backend:

The unit of storage in Cayley is a Quad: <src vertex> --<relationship>--> <dest vertex> <label>

Each of the data contents (src, dest, rel name, label) is stored as it's hash -> json
For each quad, 4 index entries are stored to aid in traversal and queries
A journal is stored with all operation deltas (additions and deletions)
Deleted entries are only marked as such, but never actually removed

Long story short, Cayley stores a significant amount of metadata about its graphs that are required for the various exploration, query and analysis algorithms. Furthermore, Cayley was honed for textual data consisting of mostly relations. These two explain the casuality of the various indexes, however in the scope of Ethereum this amount of metadata is not affordable.

Lesson learnt: although graph databases provide a valuable insight into the various indexing methods that allow implementing fast graph traversals and queries, deploying a full blown graph database on top of our already large dataset is too heavy weight.

Chosen solution: Child → parent index

The solution we ended up with is the introduction of a child→parent state trie node relationship index: whenever a node references an existing node in the database, a <child hash><parent hash> entry is added to the index. This single index enables the upwards exploration of the state trie, enabling a contextual reference counting, where we know not only if the couter reaches zero, but also the origins of the couter, enabling the correction of aborted pruning runs.

Evaluation to be continued...

Stale fork pruning

To be designed...

robotally · 2016-01-08T17:26:53Z

Vote	Count	Reviewers
👍	0
👎	0

Updated: Fri Jan 8 17:32:50 UTC 2016

codecov-io · 2016-01-08T17:32:50Z

Current coverage is `44.58%`

Merging #2111 into develop will increase coverage by +0.18% as of b706456

Powered by Codecov. Updated on successful CI builds.

obscuren · 2016-01-09T11:16:38Z

core/state/iterator.go

+	if !it.dataIt.Next() {
+		it.dataIt = nil
+	}
+	if bytes.Compare(account.CodeHash, emptyCodeHash) != 0 {


Please use bytes.Equal. You'll never need the >, < operators.

This is really part of the trie iterator PR. I'll leave it for that to include the fix. I've pulled it in here to be able to test the indexing that needs a way to properly iterate over all dependent nodes, but it would be nice to have the iterator merged in independently as it's mostly unrelated code and would just burden the probably way to complex review of this PR.

vbuterin · 2016-10-19T14:30:44Z

There is another not very computer science theoretically nice but quite pragmatic approach: periodic sweeps. Basically, every N days, go through every account in the most recent X blocks, descend every state tree, and mark every node as current. Then sweep the DB and delete all non-current nodes.

fjl · 2016-10-19T14:31:52Z

@vbuterin we tried that and it's too slow for normal use. it's implemented as a batch style command in #2489.

fjl · 2016-10-28T18:09:08Z

@karalabe I'm closing this because it's almost one year old and we will likely do pruning in a different way.

obscuren added the in progress label Jan 8, 2016

obscuren reviewed Jan 9, 2016
View reviewed changes

karalabe force-pushed the garbage-collector branch from 34e3994 to 3c49e5e Compare January 12, 2016 09:40

karalabe added 8 commits January 14, 2016 15:45

core/state, trie: add node iterator, test state/trie sync consistency

1181d93

core/state, trie: surface iterator entry hashes

3848e94

core/state, trie: node iterator reports parent hashes too

40228cb

core/state, trie: generate parent reference database indexes

a7f54f1

core/vm: resolve circular dependency to debug vm storage

371be37

core, eth, light, miner, tests, trie: cross trie indexing

c485ac8

core/state, trie: add pre-order iterator hook, fix missing indices

7097890

cmd, eth, trie: implement state database upgrade-indexing

1d1040a

karalabe force-pushed the garbage-collector branch from 8115f5c to 1d1040a Compare January 14, 2016 14:55

karalabe added this to the 1.5.0 milestone Feb 19, 2016

fjl closed this Oct 28, 2016

adamschmideg removed in progress labels Dec 14, 2018

karalabe mentioned this pull request Jan 9, 2019

[WIP] Historical state pruning #18418

Closed

dustinxie mentioned this pull request Nov 7, 2019

Trie history pruning up to a height iotexproject/iotex-core#1604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collector #2111

Garbage collector #2111

karalabe commented Jan 8, 2016

robotally commented Jan 8, 2016

codecov-io commented Jan 8, 2016

obscuren Jan 9, 2016

karalabe Jan 12, 2016

vbuterin commented Oct 19, 2016

fjl commented Oct 19, 2016

fjl commented Oct 28, 2016

Garbage collector #2111

Garbage collector #2111

Conversation

karalabe commented Jan 8, 2016

Geth garbage collection

State trie pruning

Discarded solution: Reference counting

Discarded solution: Bloom filters

Discarded solution: Graph databases

Chosen solution: Child → parent index

Stale fork pruning

robotally commented Jan 8, 2016

codecov-io commented Jan 8, 2016

Current coverage is 44.58%

obscuren Jan 9, 2016

Choose a reason for hiding this comment

karalabe Jan 12, 2016

Choose a reason for hiding this comment

vbuterin commented Oct 19, 2016

fjl commented Oct 19, 2016

fjl commented Oct 28, 2016

Current coverage is `44.58%`