-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider database key for trie node #4527
Comments
The source of this consideration is the fact that if we only needed to split Merkle Trie or Tree that is aligned by the account_id (or it's hash) that would be easy to do in single walk down the the Trie/Tree. Generally, ref counting is a source of huge problems across the stack (including performance). Also ref counting made the storage code extremely complex. For context, the reasoning for ref counting was because two account infos with the same information (like having exactly the same amount of money and same nonce and no contract) will hash into the same key. There are two proposals where made to address this issue:
Also given this reconsideration will require database re-work, we should also reconsider the account layout. Currently account info is split across few prefixes as well (e.g. 0x0+account id for account info, 0x1 +account_id + key for keys, etc). Separately, need to handle contract data which is stored under separate hash of the contract. Figuring out if we want to go with global contract storage (discussed somewhat here https://gov.near.org/t/store-protocol-config-in-a-contract/2105) and shard specific contract storage. In general, if we are considering switching to any other kind of storage - this is good to batch all together and do in a single migration. Depending on how long it will take to remap data from old storage into new, if it's under few minutes on average server - we may want to consider doing in a single process after some block vs working out on some mechanism to do this lazily or over an epoch. |
Alternatively, we should consider not maintaining persistent tree on disk. We only write to hard drive final data. All non final changes are maintained in memory. Also writing to hard drive is not blocking in this case the chunk application, which means it can be done in parallel. This creates potential vector of attack by maintaining a large number of blocks without finality. One option to address this would be to have block wait time to be a function of distance to last final block. This would allow potentially censored or slowed down validators to propagate their approvals vs block marked as missing. |
This is an interesting idea, but I think it is orthogonal to the issue we have here. Even if we only write final data, we can still end up having clashes since two shard tries can have some trie nodes that are the same. |
It's can be orthogonal, but actually if we don't need to have persistent tree on disk, there are a bunch of other things that can do to simplify the storage and this will affect removing ref counting, which would address the initial issues which led to adding For example, we can with this switch to binary merkle tree. Then given we have a single merkle binary tree in storage, the layout on disk can be a hashmap indexed by data key. In the leafs nodes Reads become direct from the underlying storage. And writes to the storage happen when block becomes final, where a whole batch of updates is saved with set of {ids -> Nodes} + {keys -> Leafs} are updated and some new added. Note that most of the intermediate nodes don't need to be stored on disk and can be recomputed form the in-memory data as at this point it's known that storage access is way bigger bottleneck than computing a bunch of hashes. Also splitting, merging and removing sub-tree becomes way simpler with binary merkle trees. |
@ilblackdragon if we don't have persistent trie on disk then we likely need to limit how far back a challenge can go to a small number. Otherwise a malicious actor can stall the chain for a long period of time by challenging a chunk that is in some 20000 blocks earlier. Another issue is that we need to prune while we process new blocks because if we wait for garbage collection it could blow up memory usage.
Those ids are globally unique I assume? |
Well, one would say that if you can challenge a block, it's not truly final. So I agree that limiting how far before one can challenge is important. This is also true in the current implementation as well - we already limit challenges to the current epoch, beyond which one can't really challenge as validator may not have the state.
Prune what exactly? Do you mean prune the in memory trie of the changes?
Yes. |
Yes. Otherwise we would keep all the old forks in memory until garbage collector catches up, which could be potentially be exploited. |
Finalization will automatically prune everything in memory that was created before this final block, including forks and uncles. |
@ilblackdragon that means we will have trouble accessing states that are more than a few blocks old, which is not only problematic for rpc nodes, but also poses issues for state sync. |
Discussed this today with the team and we think in general it is a good idea to replace refcounting with global ids. One implication with this is that the same node may be replicated multiple times in storage instead of having one node with refcount. This may be problematic because today certain contracts are deployed on a lot of different accounts (multisig for example) and they are relatively large (~300kb). Replicating them for each account could end up increasing the storage usage significantly. |
Yes, this brings to global contract storage question: we should have a way to deploy a contract once and it will be stored across all shards and available via hash. This will avoid problem with replicated contracts been overcharged and resharding problem where you don’t know which shard needs which contracts. IMO it make sense to draft a storage revamp NEP that includes all of these questions and resolve them via a single transition. |
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. |
Today when we save a trie node, the key we use is the concatenation of shard id and the trie node hash. This is problematic when we want to split one shard into multiple as it does not allow us to simply split on the trie level. Instead we need to rebuild the trie with a different shard id for the database key. The reason why we decided to have shard id as part of the database key in the first place is because of #2568, but, as @ilblackdragon suggested, this seems to be an overkill.
The text was updated successfully, but these errors were encountered: