Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rollback support in the event of an incorrect hash #10281

Closed
4 tasks
Tracked by #10656 ...
cmwaters opened this issue Oct 1, 2021 · 27 comments · Fixed by #11179
Closed
4 tasks
Tracked by #10656 ...

Add rollback support in the event of an incorrect hash #10281

cmwaters opened this issue Oct 1, 2021 · 27 comments · Fixed by #11179

Comments

@cmwaters
Copy link
Contributor

cmwaters commented Oct 1, 2021

Problem Definition

The original issue can be found here. As a quick summary, in the event of a non-deterministic app hash or when an upgrade fails, Tendermint will have persisted the incorrect AppHash and nodes will be unable to make progress. What needs to happen is that the application should revert back to the previous state, Tendermint should also rollback to the previous state, then upon startup Tendermint can replay the last block and should now have the correct AppHash to continue.

Proposal

Work on the Tendermint side is underway here and will be backported to v0.34.14 when it is merged. It exposes a public function RollbackState which the SDK can use to provide the rollback tooling necessary.

cc @aaronc, @robert-zaremba, @ethanfrey


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@clevinson
Copy link
Contributor

@robert-zaremba can you coordinate this in context of the upgrade wg?

@robert-zaremba
Copy link
Collaborator

My understanding of this issue is to solve the following use case:

  1. There is a bad app update which causes none-determinism
  2. chain nodes starts and manage to produce only one block but don't agree on it because of the non-determinism
  3. the chain halts
  4. now we need to reach a social consensus and rollback that block (or more blocks if needed) and start again with a patched app.

Do I get it right, @cmwaters ? How important it is? Is it fine to roll it with v0.45?
What's the expected flow? How about this:

  • App will expose a command to rollback a state (eg: simd state rollback x where x = number of blocks)
  • App will call tendermint to do the same.

@cmwaters
Copy link
Contributor Author

cmwaters commented Oct 2, 2021

Do I get it right, @cmwaters ?

Yup, although a minor clarification, Tendermint has its own state that tracks validator sets, consensus params and the app hash. We aren't actually removing any blocks, only the Tendermint state at that height. When the node restarts it will replay the same transactions in the last block.

How important it is? Is it fine to roll it with v0.45?

Hard for me to judge how important it is because I haven't built an app before, but it was requested and it makes sense as a piece of tooling that is required. I think having it in v0.45 is fine. I don't see why it also couldn't be backported in the v0.44 range (we will be backporting it into v0.34 Tendermint).

How about this:

  • App will expose a command to rollback a state (eg: simd state rollback x where x = number of blocks)
  • App will call tendermint to do the same

Tendermint will have a public function (called RollbackState - not sure which package it will be in yet) which takes in the tendermint/config.Config as an argument. This means that the app should just be able to call a single command in the Cosmos SDK i.e. simd state rollback and it should rollback the app state to the previous height and then should call this Tendermint function to rollback Tendermint state.

I thought about adding a --height flag so users could rollback to a certain height but I don't believe it is that necessary. Only the last height is important. Every prior one, consensus had successfully agreed on the same AppHash.

@ethanfrey
Copy link
Contributor

We hit a non determinism issue somehow related to an ibcclientupdate tx. Leading to AppHash mismatch.

While trying to debug and actually get enough info to make a proper bug report, we were hampered by the lack of any tool to try one block earlier. And are syncing 510.000 blocks to try to see if we can reproduce it, not so fun.

I can also try to lend a hand for a fix on 0.42. As soon as the tendermint fix is in 0.34

@ethanfrey
Copy link
Contributor

NB this has been requested by many chains over the last few years. And I built some tool back in 2018/2109 that did this, but never was maintained. AppHash mismatch is a pain to debug currently

@robert-zaremba
Copy link
Collaborator

and then should call this Tendermint function to rollback Tendermint state.

@cmwaters , I think we need to expose RPC call as well: when a tendermint is run independently (as a separate process) we should still have a way to send that request.

@ethanfrey
Copy link
Contributor

ethanfrey commented Oct 4, 2021

This should not be a publicly available rpc endpoint. And does not need to be run remotely. We do need to handle the multi-process use case however.

Something like tendermint rollback and then gaiad rollback seems like it cover the multi-process scenario and the tendermint side is already implemented. Exposing this anywhere off of localhost seems like a huge security issue

@cmwaters
Copy link
Contributor Author

cmwaters commented Oct 4, 2021

Yup completely agree that we don't need to expose this via RPC. The process shouldn't be running when rollback is called (in fact the database will error when trying to open a second connection).

It's feasible to have a single command even for multi-process instances because no processes should be running. All you need is the tendermint Config so the command knows where to find the database and it will perform the rollback.

@cmwaters
Copy link
Contributor Author

cmwaters commented Oct 8, 2021

Just an update on the Tendermint side. We've merged changes to master and backported them to the respective branches. We'll most likely release it in v0.34.14 next week and v0.35.0 the following week (no promises :) ). The function you will want to call is this here:
https://github.com/tendermint/tendermint/blob/f2a8f5e054cf99ebe246818bb6d71f41f9a30faa/cmd/tendermint/commands/rollback.go#L37

@ethanfrey
Copy link
Contributor

Thank you. I am watching this issue. Please do let me know when this is available in 0.34.14 and I will try to make time to start a branch on the cosmos-sdk side (exploratory work first)

@zmanian
Copy link
Member

zmanian commented Nov 14, 2021

So I investigated using the Tendermint rollback command in combination with modified application state to help the Thorchain team recover a halted network.

One thing the Tendermint rollback command doesn't do is delete the block that was rolled back, this seems to cause the ABCI replay handshake to fail because it tries to apply the same block again.

@cmwaters
Copy link
Contributor Author

The idea was intentionally to not delete the block but just Tendermint's own internal state (which tracks the AppHash). That way upon start up Tendermint would replay that last block and get the "new" app hash from the application and persist that.

Is there a need to remove the block as well? I would think the application would just skip over any "corrupted" transactions.

@ethanfrey
Copy link
Contributor

If there is code that uses this somewhere (even a one-off script), that would be great to link here to make a concrete use case. These concrete use cases usually appear in critical (chain halt) scenarios, so best to record them to prepare tooling best beforehand and leave minimal work in crisis

yihuang added a commit to yihuang/cosmos-sdk that referenced this issue Mar 3, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch
yihuang added a commit to yihuang/cosmos-sdk that referenced this issue Mar 3, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch
tac0turtle pushed a commit that referenced this issue Mar 3, 2022
Closes: #10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch
mergify bot pushed a commit that referenced this issue Mar 3, 2022
Closes: #10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum
yihuang added a commit that referenced this issue Mar 3, 2022
Closes: #10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum
tac0turtle pushed a commit that referenced this issue Mar 7, 2022
Closes: #10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum

Co-authored-by: yihuang <[email protected]>
yihuang added a commit to yihuang/cosmos-sdk that referenced this issue Mar 10, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

Co-authored-by: yihuang <[email protected]>
zakir-code pushed a commit to PundiAI/cosmos-sdk that referenced this issue Apr 8, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum

Co-authored-by: yihuang <[email protected]>
@tac0turtle
Copy link
Member

Reopening as it seems this isn't fixed yet

@tac0turtle tac0turtle reopened this May 9, 2022
@ethanfrey
Copy link
Contributor

Reopening as it seems this isn't fixed yet

Oh, no. This is super important for recovery of chains from some non-determinism issues. I think high priority to finally get fixed.

@cmwaters
Copy link
Contributor Author

There was a small problem with blocksync which persisted the block before validating it and thus could leave it in a state whereby calling rollback wouldn't be able to fix it. We've solved it now (tendermint/tendermint#8223).

As a more general comment, there are really two forms of "rolling back" in Tendermint that I think it's important to distinguish:

  1. A single node rolling back the execution of the last block in the case of a non deterministic hash. This only requires the rolling back of the state captured after executing the block so the same block can be re-executed again (and with the correct binary should lead to a deterministic hash).
  2. The entire network rolling back the block itself because it contains transactions which corrupt application state and/or halt the chain.

I have only implemented for the first case and am currently against the second case because creating a new block at the same height means all validators are effectively double signing. I'd prefer, if there are "bad txs" somewhere, that the application should be then modified so it can correctly handle them (and continue onwards) rather than Tendermint trying to wipe history.

@ethanfrey
Copy link
Contributor

I think solving (1) well is the main use case too. It should be "easy" to recover from some non-deterministic app.

(2) will likely require some dump state and export type thing and I think it is fine to make this a manual issue for now

@yihuang
Copy link
Collaborator

yihuang commented May 11, 2022

it'll be fixed by #11361

JimLarson pushed a commit to agoric-labs/cosmos-sdk that referenced this issue Jul 7, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum

Co-authored-by: yihuang <[email protected]>
randy75828 pushed a commit to Switcheo/cosmos-sdk that referenced this issue Aug 10, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum

Co-authored-by: yihuang <[email protected]>
randy75828 pushed a commit to Switcheo/cosmos-sdk that referenced this issue Aug 10, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum

Co-authored-by: yihuang <[email protected]>
tomtau pushed a commit to crypto-org-chain/cosmos-sdk that referenced this issue Sep 2, 2022
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

Co-authored-by: yihuang <[email protected]>
@robert-zaremba
Copy link
Collaborator

done in #11361

JeancarloBarrios pushed a commit to agoric-labs/cosmos-sdk that referenced this issue Sep 28, 2024
Closes: cosmos#10281

fix tendermint rollback

changelog

update tendermint to recent v0.35.x branch

(cherry picked from commit 8296ad9)

# Conflicts:
#	CHANGELOG.md
#	go.mod
#	go.sum

Co-authored-by: yihuang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants