allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

warner · 2022-09-19T18:48:07Z

What is the Problem Being Solved?

We expect to have at least one chain upgrade that discards all the swingset state and resumes from a brand new kernel/bootstrap, with no durable state outside of the cosmos IAVL tree. (Hopefully this highly traumatic upgrade will get us to a code base that can be upgraded in smaller pieces later, so we never need to do more than one such upgrade). The pso or "pismo" release is, in general, avoiding deployment of contracts or components (zoe) whose critical state cannot be recovered from the cosmos IAVL tree, to limit the damage.

When the governance proposal to perform this upgrade happens, the swingset kernel run-queue (or the new "inbound queue") might not be empty. That could cause problems with e.g. partially-committed vbank writes. In the worst case, that could violate an invariant with token balances for trades that were in-flight at the time the swingset state was abandoned.

We think it would be better if the last pre-upgrade block could drain the kernel run-queue completely.

Description of the Design

Governance proposals can set a flag to be examined at certain block heights. If the running binary does not have this flag, cosmos will halt at that point, to allow a replacement version to take over. That new version will have the flag, so the same check in the new version will not halt.

@arirubinstein said that there's already some support for doing things in the next-to-last block: some Go code can execute to prepare for the halt. I don't know the details.

We'd want to add a new ActionType (messages sent from cosmos to the cosmic-swingset JS side). We already have an END_BLOCK which causes the kernel to be cycled until the runPolicy reaches the computron/bean limit. The new action would cycle the kernel without a runPolicy constraint (just like it does for the bootstrap block, which can take as long as it needs to completely finish bootstrap). The Go code that runs pre-halt would want to use this action and let it complete fully, before halting. That way we can be sure there's nothing left in the kernel run-queue, increasing the coherence of the saved state.

This would probably also drain the "inbound queue", so e.g. all PSM trades that were waiting to get into the kernel would be executed. If that queue is very full, this might extend the final pre-halt block by quite a long time. We might want to consider adding code that deletes all these pending PSM trades (smart-wallet instructions) from the inbound queue just before we attempt to drain the kernel run-queue, cancelling those requests rather than extending the execution too long. The inbound-queue may contain other messages from Go (IBC events?), and we should decide whether we want to selectively delete only wallet instructions, or just delete the entire inbound queue.

Upgrade/Release Considerations

We don't think this change needs to go into the first PSO release (i.e. not a blocker for psmo-a). By spending an extra governance proposal / upgrade cycle, we could develop the feature later, and perform an upgrade from psmo-a to psmo-b which leaves the kernel+contracts intact, and only adds this ability to the Go and cosmic-swingset components. Then later, an upgrade from psmo-b to something more complicated would drain the swingset queue as it exits version psmo-b, making it more safe for the upgrade to clobber the swingset state upon entry to whatever comes next.

Security Considerations

If the kernel run-queue or the inbound-queue is very full, this drain operation could take a long time, proportional to the amount of work in that queue. That might be an availability problem.

Test Plan

not sure, we don't have great test coverage in cosmic-swingset

The text was updated successfully, but these errors were encountered:

warner · 2022-09-20T17:40:37Z

We should probably send a BOYD to every vat at the same time, to flush the virtual-object caches (otherwise a random amount of virtual-object state changes will still be sitting in the liveslots cache, and won't make it out to kernel-side DB state). Our vat-upgrade process (vatAdminService.upgrade()) already does this, but the kind of upgrade we're talking about here won't use that.

Draining the swingset queue can be done from the cosmos-swingset JS code, without making changes to the kernel. But swingset doesn't currently provide an external API for sending BOYD to vats, so we'll need to make kernel changes to enable this. Maybe a controller.flush()?

mhofman · 2022-11-09T20:25:50Z

@michaelfig rightly pointed out this may be doomed if we have any kind of infinite message loop in our system.

mhofman · 2023-01-24T20:49:15Z

I am not sure this should be included in the vaults release. We don't currently have a mechanism to flow upgrade details into the JS side, which would be required for this.

mhofman · 2023-02-01T20:17:35Z

I believe the conclusion of this in the last kernel meeting is that it wasn't needed.

JimLarson · 2023-02-01T20:23:51Z

Deferred per 2023-01-25 kernel meeting discussion.

mhofman · 2023-06-29T14:52:43Z

@arirubinstein said that there's already some support for doing things in the next-to-last block: some Go code can execute to prepare for the halt. I don't know the details.

I don't believe there is such a capability in the cosmos-sdk.

My understanding of this issue is that for planned upgrades, we want to winddown the Swingset run queue in the software running prior to the upgrade, so that when the upgrade height is reached, the state won't impede any potential work we may want to perform during the upgrade.

cosmos-sdk's upgrade module only triggers at begin block when the upgrade height or time is reached. If the chain software cannot handle the upgraded version, it panics (prior-version behavior). When it discovers some modules now have a different version, it runs migrations (upgraded-version behavior).

What we could potentially do is "simulate": query the upgrade module's keeper if an upgrade plan exists, and check if that plan is scheduled for block + 1 / time + x. However this means mostly duplicating the logic of the upgrade module's BeginBlocker, which feels brittle.

An alternative would be to simply drain the queue with the new software before executing any swingset upgrade specific operations. This carries different risks.

warner · 2023-08-01T22:53:02Z

This came up again in today's leads meeting. It would (somewhat) reduce the risk of vat upgrades that are bundled into chain-software upgrades, to prevent pre-existing run-queue actions from executing in the middle of a contract/zoe/wallet upgrade. The risk is objectively pretty low, since statistically most of our blocks are empty, but still.

@michaelfig / @JimLarson can you shed any light on the "does cosmos-sdk provide support for knowing when the next upgrade height will be" question?

mhofman · 2023-08-02T02:16:33Z

Chain software upgrades are handled by the upgrade module, and the upgrade mechanism is described here: https://github.com/agoric-labs/cosmos-sdk/blob/v0.45.11-alpha.agoric.3/docs/core/upgrade.md

I suppose the goal is to drain the swingset run queue and/or prevent new inbound actions from being processed sometime before an upgrade?

I believe there is currently no upgrade supported way to trigger anything ahead of the upgrade height. However each module could implement a scheduled param change on its own through the BeginBlock callback (that's actually how the upgrade module works. I don't know if there is any built-in or third-party modules that may facilitate scheduling module parameters changes at a specific height. Anyway, we likely could add support for this ourselves, and include such a scheduled param change in the the governance proposal for the upgrade.

Also, we likely could arrange the upgrade handling in cosmic-swingset to attempt draining the kernel runqueue before running the actions part of the upgrade plan. However there is always the risk that the current run is in a ping-pong loop, at least until we implement #7938 (or a more advanced scheduler). I discussed this with @michaelfig the other day, and decided it was ok to let things interleave for now.

michaelfig · 2023-08-02T03:27:06Z

does cosmos-sdk provide support for knowing when the next upgrade height will be

There is an API for that:

	if plan, hasPlan := app.UpgradeKeeper.GetUpgradePlan(ctx); hasPlan {
		fmt.Println("will execute upgrade", plan.Name, "at", plan.Height);
	}

mhofman · 2023-08-02T03:43:37Z

Oh I missed that! For some reason I thought that only returned the plan if we were at the upgrade height already.

Regardless, most of what I said earlier stands:

We need the swingset module to poll the upgrade keeper for an existing plan during every BeginBlock
Include some predefined message in the upgrade plan to instruct swingset to change params at a certain height
If the polling finds a plan, and the update params messages in it, compare the current height with the height contained in each update params message, and apply matching params updates.

Does that sound right @michaelfig

warner added enhancement New feature or request cosmic-swingset package: cosmic-swingset labels Sep 19, 2022

JimLarson added the next-release label Sep 21, 2022

warner mentioned this issue Sep 29, 2022

how to change XS on a deployed system #6361

Open

JimLarson assigned mhofman Nov 9, 2022

Tartuffo added migrate-backlog and removed migrate-backlog labels Nov 17, 2022

ivanlei added vaults_triage DO NOT USE and removed vaults-release labels Jan 4, 2023

ivanlei added this to the Vaults RC0 milestone Feb 1, 2023

rowgraus removed this from the Vaults Functional Testing milestone Feb 13, 2023

mhofman mentioned this issue Oct 1, 2024

xsnap upgrade by restart-time forced upgrade of all vats by kernel #8405

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

warner commented Sep 19, 2022

warner commented Sep 20, 2022 •

edited

Loading

mhofman commented Nov 9, 2022

mhofman commented Jan 24, 2023

mhofman commented Feb 1, 2023

JimLarson commented Feb 1, 2023

mhofman commented Jun 29, 2023

warner commented Aug 1, 2023

mhofman commented Aug 2, 2023

michaelfig commented Aug 2, 2023

mhofman commented Aug 2, 2023

allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263

Comments

warner commented Sep 19, 2022

What is the Problem Being Solved?

Description of the Design

Upgrade/Release Considerations

Security Considerations

Test Plan

warner commented Sep 20, 2022 • edited Loading

mhofman commented Nov 9, 2022

mhofman commented Jan 24, 2023

mhofman commented Feb 1, 2023

JimLarson commented Feb 1, 2023

mhofman commented Jun 29, 2023

warner commented Aug 1, 2023

mhofman commented Aug 2, 2023

michaelfig commented Aug 2, 2023

mhofman commented Aug 2, 2023

warner commented Sep 20, 2022 •

edited

Loading