-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage collector #2111
Garbage collector #2111
Conversation
Updated: Fri Jan 8 17:32:50 UTC 2016 |
Current coverage is
|
if !it.dataIt.Next() { | ||
it.dataIt = nil | ||
} | ||
if bytes.Compare(account.CodeHash, emptyCodeHash) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use bytes.Equal
. You'll never need the >
, <
operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really part of the trie iterator PR. I'll leave it for that to include the fix. I've pulled it in here to be able to test the indexing that needs a way to properly iterate over all dependent nodes, but it would be nice to have the iterator merged in independently as it's mostly unrelated code and would just burden the probably way to complex review of this PR.
34e3994
to
3c49e5e
Compare
8115f5c
to
1d1040a
Compare
There is another not very computer science theoretically nice but quite pragmatic approach: periodic sweeps. Basically, every N days, go through every account in the most recent X blocks, descend every state tree, and mark every node as current. Then sweep the DB and delete all non-current nodes. |
@karalabe I'm closing this because it's almost one year old and we will likely do pruning in a different way. |
Geth garbage collection
Currently the database storing the Go Ethereum blockchain is around 5.5 GB for a normal, sparingly running node (i.e. my develop instance). Beside all the live information that the node actually requires to function, there are various leftovers from past transaction processing and micro forks:
All of the above leftovers significantly increase the size of the database, but none of them are of actual value (unless you want to inspect past state). If we do a fast sync, gathering only the most essential data required to run the node, the database will weigh around 1GB, meaning that for an average node we have a junk multiplier of about 5x, with miners who regularly generate uncles and mini forks this probably being much higher.
As the biggest bottleneck we currently have is the disk IO, it's essential that we keep the database size down to the minimum possible. The most obvious step at the moment to achieve this is to introduce a garbage collection into Geth that will constantly discard non useful data.
This entails deleting forked data (i.e. chain elements that were overruled) and out of date state data (i.e. state trie nodes not part of the current top N blocks). In the next sections I'll describe a set of garbage collection algorithms (along with the required database modification required to enable them) that should be able to keep the accumulated junk down, while at the same time being self correcting (junk should eventually disappear, even after a crashed node or aborted GC run).
Note: the algorithms are designed with huge datasets in mind. If the chain/state database can fit into memory, even relatively naive algorithms can perform reasonably well, but our aim is to make these work as seamlessly as possible both CPU, memory and disk io wise.
State trie pruning
Discarded solution: Reference counting
The first investigated solution was the one proposed by Vitalik in his State Trie Pruning blog post.
The exact internal data representation was not clearly defined, but the solution entailed keeping a counter for each node in the database with the number of entities referencing it, marking the trie node as "deletable" if the reference reaches zero. After some number of blocks pass, nodes marked deletable and still not referenced are dropped.
The issue with this proposal was that the blog post itself is very confusing, very scarce on the details and describes a significant complexity not warranted for (special trie-node death rows, journaling system for rollbacks and replays, trie resurrections, etc).
Although the proposal could be simplified to work well in a reliable/perfect environment, the reference counters are extremely unstable in the face of crashes and or power outages where only part of the trie' counters are synchronized. Since no metadata is maintained about the origin of a reference, a partial write will cripple the algorithm: depending on the order of the prune operations, a counter may be decremented twice or never if a crash occurs exactly between the deletion of a parent and the decrement of a counter.
Lesson learnt: it doesn't matter how the pruning algorithm works, but selfcorrection after a crash is an asbolute necessity to prevent the accumulation of undeletable junk.
Discarded solution: Bloom filters
The second investigated solution was an idea based on Bloom Filters, a probabilistic data structure that can answer whether a set probably or surely doesn't contain a specific element. The base bloom filters have a limit on the number of elements it can handle, which needs to be approximately known a-priori, making it unsuitable for cases with unknown set sizes. An extension - Scalable Bloom Filters - however can work around this issue, by using partitioned filters that are constantly expanded whenever the previous batch reaches 0.5 saturation.
The algorithm idea was to traverse the entire state trie of the top N blocks, and insert each of the trie node hashes into a scalable bloom filter. Afterwards we can scan the entire state database and discard any nodes not present in the filter. This would leave some percentage of dead nodes not deleted (the error ratio of the bloom filter). While the algorithm wouldn't have been a perfect solution, it was a self-correcting one, didn't require any extra storage in the database, only a runtime memory and procesing cost to assemble the bloom filter and cross check.
We've put together a small test script to iterate a single state trie and assemble a bloom filter out of it.
The results were mostly a let down:
The takeaway is that even the current relatively small size of the state trie already took considerable resources memory but especially processing wise. These would become more and more problematic on small devices with slow disk access which won't be able to keep up with the garbage collection requirements.
Lesson learnt: the pruning algorithm must be able to run online with the rest of the system, without inducing noticable pauses; further it's better to have a somewhat slower algorithm that can delete junk non-stop than one which needs an expensive upfront collection phase, which may be aborted before it can even delete a single obsolete entry.
Discarded solution: Graph databases
Based on the previous failed attempt with bloom filters, it's clear that we need to sacrifice some data storage in favor of maintaining trie metadata to aid in pruning; but based on the failed reference couting attempt it's also clear we need a solution where the metadata is descriptive enough for self-correction.
The third pruning experiment was done by turning towards graph databases. The idea was that instead of storing the trie in a key/value store mapping hashes to node RLPs, we could extend the storage to also save all the relationship metadata between the trie nodes (and possible other system data). Having the trie represented as a graph (i.e. two directional edges between nodes) opposed to a plain tree (parent -> child traversals possible only), it would become trivial to prune nodes:
Furthermore, this explicit graph representation lends itself well to self-correction, as it's easy to add an extra garbage collection run that scans through the database and prunes previous leftovers post crash, as all the needed relationship data is present and cannot be corrupted.
To verify our little experiment, we've tried to embed a state trie into the Cayley graph database, chosen primarily because of it being written in pure Go (at least using the leveldb and bolt backends), being easily embeddable into our codebase. We've experimented with both the bolt and leveldb backends, but bolt turned out to be very underperforming so we used leveldb for the remainder of our evaluation.
The first attempt embedded the entire state trie into Cayley (relationships + node RLP), which led to a database of about 1.2 GB in size. As it turned out, the leveldb backend in Cayley serializes all vertex contents into a json structure, which is very innapropriate for Ethereum, as most of the data is binary RLP encoded, so embedding into json would inflate it significantly.
A second attempt was also made to move only the trie relationships (i.e. graph edges) into Cayley, and use it as a metadata database to aid the primary chain database. While this did save a significant amount of space, the database was still ~450 MB in size, which warranted a bit of exploration as to how Cayley exactly represents and stores its graphs inside the leveldb backend:
The unit of storage in Cayley is a Quad:
<src vertex>
--<relationship>
--><dest vertex>
<label>
hash -> json
Long story short, Cayley stores a significant amount of metadata about its graphs that are required for the various exploration, query and analysis algorithms. Furthermore, Cayley was honed for textual data consisting of mostly relations. These two explain the casuality of the various indexes, however in the scope of Ethereum this amount of metadata is not affordable.
Lesson learnt: although graph databases provide a valuable insight into the various indexing methods that allow implementing fast graph traversals and queries, deploying a full blown graph database on top of our already large dataset is too heavy weight.
Chosen solution: Child → parent index
The solution we ended up with is the introduction of a child→parent state trie node relationship index: whenever a node references an existing node in the database, a
<child hash><parent hash>
entry is added to the index. This single index enables the upwards exploration of the state trie, enabling a contextual reference counting, where we know not only if the couter reaches zero, but also the origins of the couter, enabling the correction of aborted pruning runs.Evaluation to be continued...
Stale fork pruning
To be designed...