-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uneven store heights prevent pruning and state sync #13933
Comments
Related |
Related |
Comdex v5.0.0 also has broken pruning |
Konstellation snashot/pruning broke because feegrant and authz (in rootmulti) are the wrong height (tree.ndb is nil) |
Comdex snashot/pruning broke because lendV1 store (in rootmulti) is the wrong height (tree.ndb is nil) |
.... stride has this issue, as well, in v3.0.1 |
Stride just hit the same problem with their authz store being uninitialized and caused issues for validators. |
probably because of the uneven iavl state height issue that @chillyvee is mentioning here, state sync broke on stride: git clone https://github.com/stride-labs/stride
git checkout v3.0.0
bash scripts/statesync.bash I don't remember having issues with uneven state heights in the past when adding modules. |
@faddat Juno was having block header apphash issue as well back on Nov 7th. Golden , Windpower, TedCrypto, Coldchain - fixed when bumped GO 1.19.2 from 1.18.3 |
@Reecepbcups really good. Let's break that out separately, though. I'll maybe do issue/pr for it, doing some 'splain... I think I know why that occurs.... The mixed go runtime issue is now here: @chillyvee 's PR related to this issue is here: |
another validator also reports statesync for konstellation still doesnt work (maybe a store with information with the wrong height?) |
@chillyvee -- my feeling is that you've correctly identified the problem. Not certain just yet. Here's my stride working branch: https://github.com/Stride-Labs/stride/tree/no-state-breaks That said, I figure that the fix is somewhere between here, iavl, and ibc-go, or such. |
This is a great thing to be doing. Happy to run some tests and provide feedback when we can. Looping in @qf3l3k to the chat. |
Likely root causeIt's likely along the lines of adding a KV Store Key without adding the keeper in the migration https://github.com/Stride-Labs/stride/blob/v3.0.0/app/app.go#L328
That will probably continue on to confuse all the other stores/exports/snapshots/etc To operate correctly a chain MUST do all the following at the same time:
To resolve current breakages: To completely ruin your node: Current breakagesFor chains like Konstellation/Stride where authz/feegrant/etc keepers are not initialized, statesync restore apphashes will not match. They should probably upgrade and officially issue a store upgrade Surprisingly not brokenDig statesyncs with patches Decline to fixProbably possible to emulate a broken KV store during snapshot, but that seems like a waste of time if a proper upgrade can be issued instead. (but we tried to fix it anyway, see next message) A better fixCheck that each key actually has a keeper before entering consensus. This will allow nodes to start, upgrade/migrate, then check kv-store matches before starting up. The failure mode in this case is that a chain node could halt, giving time for teams to release a new version with a migration to formally add the missing store.
|
I like your proposal CV. |
This is a potential fix for snapshotting until chains execute the final keeper + store migrations Root causeSnapshots are missing IAVL nodes when heights do not match. FixSnapshots can now be created even with mismatched rootmulti/store heights Snapshot PatchAfter checkout of chain code
Tested it against Konstellation v0.6.0 and it works. Isolate from peers without snapshot patchedYou will need to isolate your peers to avoid getting broken snapshots from others. This means setting the following on the restoring node (not the source node)
TestedKonstellation v0.6.0 snapshot and restore now operates properly Need assistanceSince stride 3.0.0/comdex 5.0.0 are also on cosmos-sdk v0.45.9, I think the patch will work there as well. For now the code seems functional, but isn't clean enough to upstream. Would appreciate some help testing before I submit the PRs. |
For Konstellation
For Comdex
For Stride
|
super interesting breakdown!! thank you for this. It seems there is a massive ux issue that leads to broken issues because they are not well documented as well. I like your approach, I believe this is the right move, we can add some of the footman preventers in previous releases as well. Lmk how I can help you |
Need another day or two to clean up some code and send in another few helpful PRs. Once you see those, we might find a pattern we can enhance :) Thank you tac0turtle! |
Lum network broke on v1.3.0 due to 'feeibc' store height mismatch. |
Hello everyone, I'm having a similar issue at the starname chain (cosmos-SDK v0.45.9), but in my case, I got an empty app hash.
And now we're using the cosmos v0.45.9 |
@chillyvee for whatever reason I cannot find the comment and/or PR, but I swear I commented somewhere that the root fix to many of these issues wrt to pruning and upgrading with new stores is that when a store is loaded for the first time, i.e. it does not currently exist, we set it's initial version to the current version and not 0. Does this ring a bell? |
@dsmello - Haven't seen an empty app hash before. Can you share some logs? Was there a recent upgrade and if so what was the tag of the version before, and the version of the upgrade? Is this the repo? https://github.com/iov-one/starnamed |
Prevention:
Fixing broken appstate:
|
Hi @chillyvee, The current version of starnamed is v0.11.6 (wasmd -> ; cosmos-SDK -> v0.45.9 ; tendermint -> v0.34.21) We migrate from v0.10.13 -> v0.11.6 (Cosmos-SDK v0.42.5 -> v0.45.9), adding these modules (authzKeeper, feegrant, escrowtypes, and the burner module), the only one with doesn't have a store was the burner module. This is the error :
I tried to do some debugging about and:
RESPONSE: %+vdata:"starnamed" version:"v0.11" last_block_height:12634200 module=statesync
SNAPSHOT: %+v&{12645200 1 1 [185 116 185 254 4 253 237 156 245 219 141 194 180 138 121 209 0 236 44 176 80 47 252 87 27 253 64 19 134 157 60 243] [10 32 185 116 185 254 4 253 237 156 245 219 141 194 180 138 121 209 0 236 44 176 80 47 252 87 27 253 64 19 134 157 60 243] [133 61 91 117 125 65 241 186 124 193 78 140 57 13 100 55 127 108 129 28 215 105 235 173 44 163 155 0 93 75 107 106]} And what logs did you need? [p2p]
persistent_peers = "[email protected]:26656"
[statesync]
enable = true
rpc_servers = "http://35.210.33.5:26657,http://35.210.33.5:26657"
trust_height = 12562301
trust_hash = "6E4FBCF45010667D230382A150461E0DD4467A7E2BEA8840607135D766657E2C"
trust_period = "2000h0m0s" |
@chillyvee Ok so we have #14410 and the author doesn't seem to be too responsive. @catShaark if you're OK with it, I will open another PR. I would like to see a single PR address these points. |
@alexanderbez - Correct the idea of #14410 (or any replacement) will prevent the root cause of the issue we commonly see. |
Overall an interesting situation. No additional logs needed but we need to pick a direction. Didn't see your ID on the committers list for starnamed, so I assume you are not a developer for that repo and a validator only? For a full fix, it would be good to bring in one of the core developers. How about taking a look at a copy of the data from a working node. Is anybody sharing pruned zip files of the data directory? |
I never added myself to the contributor's list and need to fix the git. I'll do that this week. The current version is v0.11.6 Yes, you can use the following: *I still needed to gain experience in debugging the cosmos. What I'm supposed to look at in the data folder? |
@dsmello - The issue starname is facing is NOT the one in this thread. It looks like your stores are not in the array (rs.stores is EMPTY) for snapshotting at this point in the code: It seems somehow starname is able to define all the stores and operate without setting all the related fields. That means that any restore with statesync results in no data and no apphash. |
Hi @chillyvee, Thanks for the help. |
Part of the issue is solved, notional submitted a fix to prevent chains from starting when adding a store at a different height. How can we help teams recover from this state they are in? Should we write a tool? I think people need to upgrade their chains to recover right? |
I think that there is a related issue: "message": "\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).createQueryContext\n\tgithub.com/cosmos/[email protected]/baseapp/abci.go:670\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).handleQueryGRPC\n\tgithub.com/cosmos/[email protected]/baseapp/abci.go:592\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Query\n\tgithub.com/cosmos/[email protected]/baseapp/abci.go:441\ngithub.com/tendermint/tendermint/abci/client.(*localClient).QuerySync\n\tgithub.com/tendermint/[email protected]/abci/client/local_client.go:256\ngithub.com/tendermint/tendermint/proxy.(*appConnQuery).QuerySync\n\tgithub.com/tendermint/[email protected]/proxy/app_conn.go:159\ngithub.com/tendermint/tendermint/rpc/core.ABCIQuery\n\tgithub.com/tendermint/[email protected]/rpc/core/abci.go:20\ngithub.com/tendermint/tendermint/rpc/client/local.(*Local).ABCIQueryWithOptions\n\tgithub.com/tendermint/[email protected]/rpc/client/local/local.go:87\ngithub.com/cosmos/cosmos-sdk/client.Context.queryABCI\n\tgithub.com/cosmos/[email protected]/client/query.go:94\ngithub.com/cosmos/cosmos-sdk/client.Context.QueryABCI\n\tgithub.com/cosmos/[email protected]/client/query.go:57\ngithub.com/cosmos/cosmos-sdk/client.Context.Invoke\n\tgithub.com/cosmos/[email protected]/client/grpc_query.go:81\ngithub.com/cosmos/cosmos-sdk/x/bank/types.(*queryClient).AllBalances\n\tgithub.com/cosmos/[email protected]/x/bank/types/query.pb.go:917\ngithub.com/cosmos/cosmos-sdk/x/bank/types.request_Query_AllBalances_0\n\tgithub.com/cosmos/[email protected]/x/bank/types/query.pb.gw.go:139\ngithub.com/cosmos/cosmos-sdk/x/bank/types.RegisterQueryHandlerClient.func2\n\tgithub.com/cosmos/[email protected]/x/bank/types/query.pb.gw.go:684\ngithub.com/grpc-ecosystem/grpc-gateway/runtime.(*ServeMux).ServeHTTP\n\tgithub.com/grpc-ecosystem/[email protected]/runtime/mux.go:240\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/tendermint/tendermint/rpc/jsonrpc/server.maxBytesHandler.ServeHTTP\n\tgithub.com/tendermint/[email protected]/rpc/jsonrpc/server/http_server.go:256\ngithub.com/tendermint/tendermint/rpc/jsonrpc/server.RecoverAndLogHandler.func1\n\tgithub.com/tendermint/[email protected]/rpc/jsonrpc/server/http_server.go:229\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991\nfailed to load state at height 69; version mismatch on immutable IAVL tree; version does not exist. Version has either been pruned, or is for a future block height (latest height: 69): invalid request", Related issue: |
I think they would need a separate upgrade and corresponding handler to fix the broken state. Either, perhaps, by renaming it or by deleting it and creating it again? |
This just hit kujira testnet v0.8.0 in the alliance store. |
da, i dont think we can backport the fix due to some chains having this issue. they will get hit with errors. we should write documentation around this |
Mhhh wait, why is this issue still open? Doesn't #14410 fix it? Also, why can't we backport? |
i believe a chain with uneven store heights wouldnt be able to start. Is that right? |
Ahh yes, the fix is NOT retroactive indeed. So existing chains that already have this issue will need to manually patch it in an upgrade. |
Chain runs, but pruning, snapshots and statesync are all broken without the patches we issued. |
ah okay, then we can backport, to help prevent future issues |
Hi, this issue is to track a set of issues, and may become an epic.
Pruning and/ or state sync
guess
This is because of issues in the CMD folder, and possible validator misconfiguration.
I think that this can be fixed by assisting crypto.org with migrating, then compacting their archival node state, so that they can use fast node and pebble, which makes archives much more performant.
Some of these issues were introduced here:
https://github.com/cosmos/iavl/commits/master#:~:text=feat%3A%20Add%20skipFastStorageUpgrade%20to%20MutableTree%20to%20control%20fast%20s%E2%80%A6
The text was updated successfully, but these errors were encountered: