Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collector #2111

Closed
wants to merge 8 commits into from
Closed

Conversation

karalabe
Copy link
Member

@karalabe karalabe commented Jan 8, 2016

Geth garbage collection

Currently the database storing the Go Ethereum blockchain is around 5.5 GB for a normal, sparingly running node (i.e. my develop instance). Beside all the live information that the node actually requires to function, there are various leftovers from past transaction processing and micro forks:

  • Contract state entries that are no longer relevant (contract state changed since)
  • Global state trie nodes that are no longer referenced by the current blockchain
  • Headers, blocks and receipts that were in a small fork, but overruled later

All of the above leftovers significantly increase the size of the database, but none of them are of actual value (unless you want to inspect past state). If we do a fast sync, gathering only the most essential data required to run the node, the database will weigh around 1GB, meaning that for an average node we have a junk multiplier of about 5x, with miners who regularly generate uncles and mini forks this probably being much higher.

As the biggest bottleneck we currently have is the disk IO, it's essential that we keep the database size down to the minimum possible. The most obvious step at the moment to achieve this is to introduce a garbage collection into Geth that will constantly discard non useful data.

This entails deleting forked data (i.e. chain elements that were overruled) and out of date state data (i.e. state trie nodes not part of the current top N blocks). In the next sections I'll describe a set of garbage collection algorithms (along with the required database modification required to enable them) that should be able to keep the accumulated junk down, while at the same time being self correcting (junk should eventually disappear, even after a crashed node or aborted GC run).

Note: the algorithms are designed with huge datasets in mind. If the chain/state database can fit into memory, even relatively naive algorithms can perform reasonably well, but our aim is to make these work as seamlessly as possible both CPU, memory and disk io wise.

State trie pruning

Discarded solution: Reference counting

The first investigated solution was the one proposed by Vitalik in his State Trie Pruning blog post.

The exact internal data representation was not clearly defined, but the solution entailed keeping a counter for each node in the database with the number of entities referencing it, marking the trie node as "deletable" if the reference reaches zero. After some number of blocks pass, nodes marked deletable and still not referenced are dropped.

The issue with this proposal was that the blog post itself is very confusing, very scarce on the details and describes a significant complexity not warranted for (special trie-node death rows, journaling system for rollbacks and replays, trie resurrections, etc).

Although the proposal could be simplified to work well in a reliable/perfect environment, the reference counters are extremely unstable in the face of crashes and or power outages where only part of the trie' counters are synchronized. Since no metadata is maintained about the origin of a reference, a partial write will cripple the algorithm: depending on the order of the prune operations, a counter may be decremented twice or never if a crash occurs exactly between the deletion of a parent and the decrement of a counter.

Lesson learnt: it doesn't matter how the pruning algorithm works, but selfcorrection after a crash is an asbolute necessity to prevent the accumulation of undeletable junk.

Discarded solution: Bloom filters

The second investigated solution was an idea based on Bloom Filters, a probabilistic data structure that can answer whether a set probably or surely doesn't contain a specific element. The base bloom filters have a limit on the number of elements it can handle, which needs to be approximately known a-priori, making it unsuitable for cases with unknown set sizes. An extension - Scalable Bloom Filters - however can work around this issue, by using partitioned filters that are constantly expanded whenever the previous batch reaches 0.5 saturation.

The algorithm idea was to traverse the entire state trie of the top N blocks, and insert each of the trie node hashes into a scalable bloom filter. Afterwards we can scan the entire state database and discard any nodes not present in the filter. This would leave some percentage of dead nodes not deleted (the error ratio of the bloom filter). While the algorithm wouldn't have been a perfect solution, it was a self-correcting one, didn't require any extra storage in the database, only a runtime memory and procesing cost to assemble the bloom filter and cross check.

We've put together a small test script to iterate a single state trie and assemble a bloom filter out of it.

The results were mostly a let down:

  • The test database was a live full node chain database of 5.5 GB (i.e. not pruned, not fast synced)
  • The head of the blockchain contained a state trie of ~1.6M nodes, 550K unique database entiries
  • Iterating and inserting the state trie into a scalable bloom filter took ~24 seconds on a ZenBook Pro and consumed 7-12MB memory depending a filter configs

The takeaway is that even the current relatively small size of the state trie already took considerable resources memory but especially processing wise. These would become more and more problematic on small devices with slow disk access which won't be able to keep up with the garbage collection requirements.

Lesson learnt: the pruning algorithm must be able to run online with the rest of the system, without inducing noticable pauses; further it's better to have a somewhat slower algorithm that can delete junk non-stop than one which needs an expensive upfront collection phase, which may be aborted before it can even delete a single obsolete entry.

Discarded solution: Graph databases

Based on the previous failed attempt with bloom filters, it's clear that we need to sacrifice some data storage in favor of maintaining trie metadata to aid in pruning; but based on the failed reference couting attempt it's also clear we need a solution where the metadata is descriptive enough for self-correction.

The third pruning experiment was done by turning towards graph databases. The idea was that instead of storing the trie in a key/value store mapping hashes to node RLPs, we could extend the storage to also save all the relationship metadata between the trie nodes (and possible other system data). Having the trie represented as a graph (i.e. two directional edges between nodes) opposed to a plain tree (parent -> child traversals possible only), it would become trivial to prune nodes:

  • Whenever the last parent of a node is deleted, delete the node itself too
  • Whenever a node is deleted, delete all it's links to children

Furthermore, this explicit graph representation lends itself well to self-correction, as it's easy to add an extra garbage collection run that scans through the database and prunes previous leftovers post crash, as all the needed relationship data is present and cannot be corrupted.

To verify our little experiment, we've tried to embed a state trie into the Cayley graph database, chosen primarily because of it being written in pure Go (at least using the leveldb and bolt backends), being easily embeddable into our codebase. We've experimented with both the bolt and leveldb backends, but bolt turned out to be very underperforming so we used leveldb for the remainder of our evaluation.

The first attempt embedded the entire state trie into Cayley (relationships + node RLP), which led to a database of about 1.2 GB in size. As it turned out, the leveldb backend in Cayley serializes all vertex contents into a json structure, which is very innapropriate for Ethereum, as most of the data is binary RLP encoded, so embedding into json would inflate it significantly.

A second attempt was also made to move only the trie relationships (i.e. graph edges) into Cayley, and use it as a metadata database to aid the primary chain database. While this did save a significant amount of space, the database was still ~450 MB in size, which warranted a bit of exploration as to how Cayley exactly represents and stores its graphs inside the leveldb backend:

The unit of storage in Cayley is a Quad: <src vertex> --<relationship>--> <dest vertex> <label>

  • Each of the data contents (src, dest, rel name, label) is stored as it's hash -> json
  • For each quad, 4 index entries are stored to aid in traversal and queries
  • A journal is stored with all operation deltas (additions and deletions)
  • Deleted entries are only marked as such, but never actually removed

Long story short, Cayley stores a significant amount of metadata about its graphs that are required for the various exploration, query and analysis algorithms. Furthermore, Cayley was honed for textual data consisting of mostly relations. These two explain the casuality of the various indexes, however in the scope of Ethereum this amount of metadata is not affordable.

Lesson learnt: although graph databases provide a valuable insight into the various indexing methods that allow implementing fast graph traversals and queries, deploying a full blown graph database on top of our already large dataset is too heavy weight.

Chosen solution: Child → parent index

The solution we ended up with is the introduction of a child→parent state trie node relationship index: whenever a node references an existing node in the database, a <child hash><parent hash> entry is added to the index. This single index enables the upwards exploration of the state trie, enabling a contextual reference counting, where we know not only if the couter reaches zero, but also the origins of the couter, enabling the correction of aborted pruning runs.

Evaluation to be continued...

Stale fork pruning

To be designed...

@robotally
Copy link

Vote Count Reviewers
👍 0
👎 0

Updated: Fri Jan 8 17:32:50 UTC 2016

@codecov-io
Copy link

Current coverage is 44.58%

Merging #2111 into develop will increase coverage by +0.18% as of b706456

Powered by Codecov. Updated on successful CI builds.

if !it.dataIt.Next() {
it.dataIt = nil
}
if bytes.Compare(account.CodeHash, emptyCodeHash) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use bytes.Equal. You'll never need the >, < operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really part of the trie iterator PR. I'll leave it for that to include the fix. I've pulled it in here to be able to test the indexing that needs a way to properly iterate over all dependent nodes, but it would be nice to have the iterator merged in independently as it's mostly unrelated code and would just burden the probably way to complex review of this PR.

@karalabe karalabe added this to the 1.5.0 milestone Feb 19, 2016
@vbuterin
Copy link
Contributor

There is another not very computer science theoretically nice but quite pragmatic approach: periodic sweeps. Basically, every N days, go through every account in the most recent X blocks, descend every state tree, and mark every node as current. Then sweep the DB and delete all non-current nodes.

@fjl
Copy link
Contributor

fjl commented Oct 19, 2016

@vbuterin we tried that and it's too slow for normal use. it's implemented as a batch style command in #2489.

@fjl
Copy link
Contributor

fjl commented Oct 28, 2016

@karalabe I'm closing this because it's almost one year old and we will likely do pruning in a different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants