-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow pre-upgrade-block cosmos upgrade handler to drain swingset queue #6263
Comments
We should probably send a BOYD to every vat at the same time, to flush the virtual-object caches (otherwise a random amount of virtual-object state changes will still be sitting in the liveslots cache, and won't make it out to kernel-side DB state). Our vat-upgrade process ( Draining the swingset queue can be done from the cosmos-swingset JS code, without making changes to the kernel. But swingset doesn't currently provide an external API for sending BOYD to vats, so we'll need to make kernel changes to enable this. Maybe a |
@michaelfig rightly pointed out this may be doomed if we have any kind of infinite message loop in our system. |
I am not sure this should be included in the vaults release. We don't currently have a mechanism to flow upgrade details into the JS side, which would be required for this. |
I believe the conclusion of this in the last kernel meeting is that it wasn't needed. |
Deferred per 2023-01-25 kernel meeting discussion. |
I don't believe there is such a capability in the cosmos-sdk. My understanding of this issue is that for planned upgrades, we want to winddown the Swingset run queue in the software running prior to the upgrade, so that when the upgrade height is reached, the state won't impede any potential work we may want to perform during the upgrade. cosmos-sdk's upgrade module only triggers at begin block when the upgrade height or time is reached. If the chain software cannot handle the upgraded version, it panics (prior-version behavior). When it discovers some modules now have a different version, it runs migrations (upgraded-version behavior). What we could potentially do is "simulate": query the upgrade module's keeper if an upgrade plan exists, and check if that plan is scheduled for block + 1 / time + x. However this means mostly duplicating the logic of the upgrade module's An alternative would be to simply drain the queue with the new software before executing any swingset upgrade specific operations. This carries different risks. |
This came up again in today's leads meeting. It would (somewhat) reduce the risk of vat upgrades that are bundled into chain-software upgrades, to prevent pre-existing run-queue actions from executing in the middle of a contract/zoe/wallet upgrade. The risk is objectively pretty low, since statistically most of our blocks are empty, but still. @michaelfig / @JimLarson can you shed any light on the "does cosmos-sdk provide support for knowing when the next upgrade height will be" question? |
Chain software upgrades are handled by the I suppose the goal is to drain the swingset run queue and/or prevent new inbound actions from being processed sometime before an upgrade? I believe there is currently no Also, we likely could arrange the upgrade handling in cosmic-swingset to attempt draining the kernel runqueue before running the actions part of the upgrade plan. However there is always the risk that the current run is in a ping-pong loop, at least until we implement #7938 (or a more advanced scheduler). I discussed this with @michaelfig the other day, and decided it was ok to let things interleave for now. |
There is an API for that: if plan, hasPlan := app.UpgradeKeeper.GetUpgradePlan(ctx); hasPlan {
fmt.Println("will execute upgrade", plan.Name, "at", plan.Height);
} |
Oh I missed that! For some reason I thought that only returned the plan if we were at the upgrade height already. Regardless, most of what I said earlier stands:
Does that sound right @michaelfig |
What is the Problem Being Solved?
We expect to have at least one chain upgrade that discards all the swingset state and resumes from a brand new kernel/bootstrap, with no durable state outside of the cosmos IAVL tree. (Hopefully this highly traumatic upgrade will get us to a code base that can be upgraded in smaller pieces later, so we never need to do more than one such upgrade). The
pso
or "pismo" release is, in general, avoiding deployment of contracts or components (zoe) whose critical state cannot be recovered from the cosmos IAVL tree, to limit the damage.When the governance proposal to perform this upgrade happens, the swingset kernel run-queue (or the new "inbound queue") might not be empty. That could cause problems with e.g. partially-committed
vbank
writes. In the worst case, that could violate an invariant with token balances for trades that were in-flight at the time the swingset state was abandoned.We think it would be better if the last pre-upgrade block could drain the kernel run-queue completely.
Description of the Design
Governance proposals can set a flag to be examined at certain block heights. If the running binary does not have this flag, cosmos will halt at that point, to allow a replacement version to take over. That new version will have the flag, so the same check in the new version will not halt.
@arirubinstein said that there's already some support for doing things in the next-to-last block: some Go code can execute to prepare for the halt. I don't know the details.
We'd want to add a new ActionType (messages sent from cosmos to the cosmic-swingset JS side). We already have an
END_BLOCK
which causes the kernel to be cycled until therunPolicy
reaches the computron/bean limit. The new action would cycle the kernel without arunPolicy
constraint (just like it does for the bootstrap block, which can take as long as it needs to completely finish bootstrap). The Go code that runs pre-halt would want to use this action and let it complete fully, before halting. That way we can be sure there's nothing left in the kernel run-queue, increasing the coherence of the saved state.This would probably also drain the "inbound queue", so e.g. all PSM trades that were waiting to get into the kernel would be executed. If that queue is very full, this might extend the final pre-halt block by quite a long time. We might want to consider adding code that deletes all these pending PSM trades (smart-wallet instructions) from the inbound queue just before we attempt to drain the kernel run-queue, cancelling those requests rather than extending the execution too long. The inbound-queue may contain other messages from Go (IBC events?), and we should decide whether we want to selectively delete only wallet instructions, or just delete the entire inbound queue.
Upgrade/Release Considerations
We don't think this change needs to go into the first PSO release (i.e. not a blocker for
psmo-a
). By spending an extra governance proposal / upgrade cycle, we could develop the feature later, and perform an upgrade frompsmo-a
topsmo-b
which leaves the kernel+contracts intact, and only adds this ability to the Go and cosmic-swingset components. Then later, an upgrade frompsmo-b
to something more complicated would drain the swingset queue as it exits versionpsmo-b
, making it more safe for the upgrade to clobber the swingset state upon entry to whatever comes next.Security Considerations
If the kernel run-queue or the inbound-queue is very full, this drain operation could take a long time, proportional to the amount of work in that queue. That might be an availability problem.
Test Plan
not sure, we don't have great test coverage in cosmic-swingset
The text was updated successfully, but these errors were encountered: