Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

Open
warner opened this issue Sep 19, 2022 · 10 comments
Open

allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

warner opened this issue Sep 19, 2022 · 10 comments
Assignees
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request vaults_triage DO NOT USE

Comments

@warner
Copy link
Member

warner commented Sep 19, 2022

What is the Problem Being Solved?

We expect to have at least one chain upgrade that discards all the swingset state and resumes from a brand new kernel/bootstrap, with no durable state outside of the cosmos IAVL tree. (Hopefully this highly traumatic upgrade will get us to a code base that can be upgraded in smaller pieces later, so we never need to do more than one such upgrade). The pso or "pismo" release is, in general, avoiding deployment of contracts or components (zoe) whose critical state cannot be recovered from the cosmos IAVL tree, to limit the damage.

When the governance proposal to perform this upgrade happens, the swingset kernel run-queue (or the new "inbound queue") might not be empty. That could cause problems with e.g. partially-committed vbank writes. In the worst case, that could violate an invariant with token balances for trades that were in-flight at the time the swingset state was abandoned.

We think it would be better if the last pre-upgrade block could drain the kernel run-queue completely.

Description of the Design

Governance proposals can set a flag to be examined at certain block heights. If the running binary does not have this flag, cosmos will halt at that point, to allow a replacement version to take over. That new version will have the flag, so the same check in the new version will not halt.

@arirubinstein said that there's already some support for doing things in the next-to-last block: some Go code can execute to prepare for the halt. I don't know the details.

We'd want to add a new ActionType (messages sent from cosmos to the cosmic-swingset JS side). We already have an END_BLOCK which causes the kernel to be cycled until the runPolicy reaches the computron/bean limit. The new action would cycle the kernel without a runPolicy constraint (just like it does for the bootstrap block, which can take as long as it needs to completely finish bootstrap). The Go code that runs pre-halt would want to use this action and let it complete fully, before halting. That way we can be sure there's nothing left in the kernel run-queue, increasing the coherence of the saved state.

This would probably also drain the "inbound queue", so e.g. all PSM trades that were waiting to get into the kernel would be executed. If that queue is very full, this might extend the final pre-halt block by quite a long time. We might want to consider adding code that deletes all these pending PSM trades (smart-wallet instructions) from the inbound queue just before we attempt to drain the kernel run-queue, cancelling those requests rather than extending the execution too long. The inbound-queue may contain other messages from Go (IBC events?), and we should decide whether we want to selectively delete only wallet instructions, or just delete the entire inbound queue.

Upgrade/Release Considerations

We don't think this change needs to go into the first PSO release (i.e. not a blocker for psmo-a). By spending an extra governance proposal / upgrade cycle, we could develop the feature later, and perform an upgrade from psmo-a to psmo-b which leaves the kernel+contracts intact, and only adds this ability to the Go and cosmic-swingset components. Then later, an upgrade from psmo-b to something more complicated would drain the swingset queue as it exits version psmo-b, making it more safe for the upgrade to clobber the swingset state upon entry to whatever comes next.

Security Considerations

If the kernel run-queue or the inbound-queue is very full, this drain operation could take a long time, proportional to the amount of work in that queue. That might be an availability problem.

Test Plan

not sure, we don't have great test coverage in cosmic-swingset

@warner warner added enhancement New feature or request cosmic-swingset package: cosmic-swingset labels Sep 19, 2022
@warner
Copy link
Member Author

warner commented Sep 20, 2022

We should probably send a BOYD to every vat at the same time, to flush the virtual-object caches (otherwise a random amount of virtual-object state changes will still be sitting in the liveslots cache, and won't make it out to kernel-side DB state). Our vat-upgrade process (vatAdminService.upgrade()) already does this, but the kind of upgrade we're talking about here won't use that.

Draining the swingset queue can be done from the cosmos-swingset JS code, without making changes to the kernel. But swingset doesn't currently provide an external API for sending BOYD to vats, so we'll need to make kernel changes to enable this. Maybe a controller.flush()?

@mhofman
Copy link
Member

mhofman commented Nov 9, 2022

@michaelfig rightly pointed out this may be doomed if we have any kind of infinite message loop in our system.

@mhofman
Copy link
Member

mhofman commented Jan 24, 2023

I am not sure this should be included in the vaults release. We don't currently have a mechanism to flow upgrade details into the JS side, which would be required for this.

@ivanlei ivanlei added this to the Vaults RC0 milestone Feb 1, 2023
@mhofman
Copy link
Member

mhofman commented Feb 1, 2023

I believe the conclusion of this in the last kernel meeting is that it wasn't needed.

@JimLarson
Copy link
Contributor

Deferred per 2023-01-25 kernel meeting discussion.

@rowgraus rowgraus removed this from the Vaults Functional Testing milestone Feb 13, 2023
@mhofman
Copy link
Member

mhofman commented Jun 29, 2023

@arirubinstein said that there's already some support for doing things in the next-to-last block: some Go code can execute to prepare for the halt. I don't know the details.

I don't believe there is such a capability in the cosmos-sdk.

My understanding of this issue is that for planned upgrades, we want to winddown the Swingset run queue in the software running prior to the upgrade, so that when the upgrade height is reached, the state won't impede any potential work we may want to perform during the upgrade.

cosmos-sdk's upgrade module only triggers at begin block when the upgrade height or time is reached. If the chain software cannot handle the upgraded version, it panics (prior-version behavior). When it discovers some modules now have a different version, it runs migrations (upgraded-version behavior).

What we could potentially do is "simulate": query the upgrade module's keeper if an upgrade plan exists, and check if that plan is scheduled for block + 1 / time + x. However this means mostly duplicating the logic of the upgrade module's BeginBlocker, which feels brittle.

An alternative would be to simply drain the queue with the new software before executing any swingset upgrade specific operations. This carries different risks.

@warner
Copy link
Member Author

warner commented Aug 1, 2023

This came up again in today's leads meeting. It would (somewhat) reduce the risk of vat upgrades that are bundled into chain-software upgrades, to prevent pre-existing run-queue actions from executing in the middle of a contract/zoe/wallet upgrade. The risk is objectively pretty low, since statistically most of our blocks are empty, but still.

@michaelfig / @JimLarson can you shed any light on the "does cosmos-sdk provide support for knowing when the next upgrade height will be" question?

@mhofman
Copy link
Member

mhofman commented Aug 2, 2023

Chain software upgrades are handled by the upgrade module, and the upgrade mechanism is described here: https://github.com/agoric-labs/cosmos-sdk/blob/v0.45.11-alpha.agoric.3/docs/core/upgrade.md

I suppose the goal is to drain the swingset run queue and/or prevent new inbound actions from being processed sometime before an upgrade?

I believe there is currently no upgrade supported way to trigger anything ahead of the upgrade height. However each module could implement a scheduled param change on its own through the BeginBlock callback (that's actually how the upgrade module works. I don't know if there is any built-in or third-party modules that may facilitate scheduling module parameters changes at a specific height. Anyway, we likely could add support for this ourselves, and include such a scheduled param change in the the governance proposal for the upgrade.

Also, we likely could arrange the upgrade handling in cosmic-swingset to attempt draining the kernel runqueue before running the actions part of the upgrade plan. However there is always the risk that the current run is in a ping-pong loop, at least until we implement #7938 (or a more advanced scheduler). I discussed this with @michaelfig the other day, and decided it was ok to let things interleave for now.

@michaelfig
Copy link
Member

does cosmos-sdk provide support for knowing when the next upgrade height will be

There is an API for that:

	if plan, hasPlan := app.UpgradeKeeper.GetUpgradePlan(ctx); hasPlan {
		fmt.Println("will execute upgrade", plan.Name, "at", plan.Height);
	}

@mhofman
Copy link
Member

mhofman commented Aug 2, 2023

Oh I missed that! For some reason I thought that only returned the plan if we were at the upgrade height already.

Regardless, most of what I said earlier stands:

  • We need the swingset module to poll the upgrade keeper for an existing plan during every BeginBlock
  • Include some predefined message in the upgrade plan to instruct swingset to change params at a certain height
  • If the polling finds a plan, and the update params messages in it, compare the current height with the height contained in each update params message, and apply matching params updates.

Does that sound right @michaelfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request vaults_triage DO NOT USE
Projects
None yet
Development

No branches or pull requests

7 participants