-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to change XS on a deployed system #6361
Comments
We definitely need to figure out how to handle multiple versions of vat related code, not just xsnap, but supervisor and liveslots, and probably the kernel side adapter that talks to xsnap (vat translators?) |
(comment removed, apparently I had some sort of cut-and-paste error while editing, and wound up with two versions of the same comment: the one below is the right version) |
Let's see, there's an interface boundary (at one layer of the abstraction/protocol/technology stack) between the kernel and the worker: e.g. how does it encode syscalls into netstrings, how does the kernel ask it to perform a heap snapshot, that sort of thing. We could imagine changing that protocol (e.g. change from "please write a heap snapshot to filename XYZ" to "please write a heap snapshot on previously-established file descriptor 6" ala #6363), by changing the There's another boundary (higher up the stack) between the kernel and the liveslots that lives in the worker, which talks about how syscalls are expressed, and marshalling. If it didn't threaten (or we didn't care about) determinism, say in an ag-solo, then we could e.g. switch to "smallcaps" (#6326), or change Given that we do care about determism, there's effectively a boundary between the kernel and the persistent vat image that lives in the worker, the one that includes userspace, So our initial kernel version K1 will only be able to communicate "vat image protocol" P1, and we'll deploy some set of vats that speak P1. "P1" consists of all the protocol- or compatibility- relevant platform layers: liveslots and supervisor, for sure. The particular version of e.g. SES being used on the worker isn't really part of P1, even though you can't change it without doing a vat-upgrade, because the kernel doesn't care. But the behavior of Let's imagine one package that contains the kernel (at some particular SemVer version, nicknamed K1), and a separate package "worker" to hold all the code that goes into the worker (so it would import supervisor, lockdown/Endo, liveslots, plus Later, we define the "P2" protocol, with some enhancements/changes to P1. We release a worker W2 that speaks P2. We change the kernel package to speak both P1 and P2, and make it depend upon W2 instead of W1, and then we release kernel K2. When an existing deployment is upgraded to K2, it bundles the worker code into a second DB key for W2, but all existing vats keep using W1 from the DB. Any new vats, or any vat upgrades, will start using W2 (and the kernel will speak to them with P2 instead of P1). The kernel retains W1 in the DB until the last vat is upgraded, at which point the refcount drops to zero and it can be deleted (maybe). (We'd have one refcount held on W2 by virtue of it being the default for new vats). A brand new deployment of K2 will bundle W2 and use it for all vats, and won't have a copy of W1 in its DB. That kinda implies that the bundled/stored W1 also includes the binary for Another way to think about it is that the supervisor/liveslots/etc bundles are part of the worker/ But in any case, a kernel deployment that has older vat images needs to retain the ability to use the W1 We can imagine small changes to W1 that would be handled in different ways:
But for anything that 1: cannot start from a heap snapshot, 2: causes GC divergence, or 3: is not compatible with the "P1" protocol, a deployed kernel must either upgrade all vats across the K1/K2 boundary, or must keep a copy of the old worker and bundles until all vats have been upgraded to something newer. Node.js is not excited about a single package having dependencies on multiple versions of the same package: it can handle dependency graphs like (A->B@v1, A->C, A->B@v2), but neither the |
#6596 is about defining a stable package to encapsulate the worker behavior, including XS. I think that covers most of this issue, although it's more like "how to maintain a stable XS on a deployed system despite changes to the kernel or other software". Once we have that stable package, and the kernel package pins a specific version of the worker (no The first answer is for small changes, which are capable of starting from heap snapshots left by earlier versions. We'll express these with a worker version like The second answer is for XS changes that cannot accomodate an earlier snapshot (expressed by an entirely new worker package, e.g.
|
We've backed away from versioned worker packages. Instead, for now, our rule is that we aren't allowed to make incompatible changes to XS or xsnap. We can (carefully) make compatible changes, if we can convince ourselves that:
If we believe that we've nailed down the last of the GC syscall sensitivity, then that allows us to make XS changes which affect GC timing and metering (as long as we don't exceed the hard per-delivery computron limit or the memory-allocation limit in either the original or the replay). To enable us to make more significant changes, we must first implement a scheme to simultaneously run two different versions of xsnap at the same time. We don't need a full I'm sufficiently confident that we can implement this later, and that we can upgrade the kernel (and other parts of agoric-sdk) without changing XS or xsnap. So I'm moving this ticket out of the Vaults milestone. |
I've created a new issue to document an alternative to the multi worker approach for incompatible XS worker updates. I believe it makes it possible to only ever have a single version of XS, and avoid traumatic vat upgrades when updating XS (at the cost of potentially longer chain upgrades) |
What is the Problem Being Solved?
How do we deploy a fixed/improved
xsnap
or XS engine once we've launched the chain? Or more generally, for any deployed swingset kernel (with at least one vat running, i.e. all of them), how do we take advantage of bugfixes, security fixes, performance improvements, or new features, in the XS javascript engine or thexsnap
program we use to host vat workers? While maintaining consistent/deterministic behavior among the members of a consensus group?I've been assuming that the only way to do this safely will be to add a "which version of XS should we use?" field to each vat's stored
options
, where the current missing value means "the first one", so every vat will keep using the same XS until the vat is upgraded. Then we'd do a binary upgrade to add a new version of XS (giving us two to choose from), then do a baggage-style upgrade of each vat, where the new version uses the same vat bundle but the most recent XS version. That would provide fully-identical vat behavior independent of new XS versions up until the vat upgrade, and we can tolerate any amount of different XS/xsnap behavior because it only appears in the second incarnation (and all validators perform the vat upgrade at the same time).@arirubinstein and @mhofman pointed out that 1: this is painful, and 2: might not be necessary. Their proposal, for relatively small XS changes, would be:
agd
with a new version) at a specific block height, as we'd do for any changes to cosmos or the kernelagd
version starts, all vat workers will be restarted with the newxsnap
andXS
codeOf course, this approach only works if the new XS is sufficiently similar to the old one:
vatstoreGet
syscalls that do not exist in the transcript (nor do the results they want back), causing the replay to faildispatch.bringOutYourDead
forces GC, which basically "resets the clock", after which the timing difference doesn't matterTo test this thoroughly before performing an upgrade, we'd need to know the state of each vat at the point of upgrade. And since we schedule upgrades ahead of time (at a particular block height), we could not reliably predict that state, making such upgrades kinda risky.
The risk could be removed by including a kernel-wide
dispatch.bringOutYourDead
(to every vat) as the last thing done before the old version is shut down. #6263 is about draining the swingset queue entirely at this point, which might take a significant amount of time, but if we're only changing XS, thevats.forEach(vat => vat.BOYD())
would probably be sufficient, and might complete in just a few seconds.Larger XS Upgrades
If the XS/xsnap change cannot load a heap snapshot created by its predecessor, then we can't just swap out the xsnap binary. Instead, we must perform a vat upgrade, so we have a new incarnation of the vat which uses the second version (it could use different vat code too, or a different liveslots/supervisor/SES: any amount of change is ok as long as it can read the
vatstore
data of its predecessor, and can fulfill the obligations implied by its exported vrefs).If we want to avoid having multiple versions of XS at the same time, we'd need the last act of the old
agd
execution to be astopVat
on all vats. Then the first act of the newagd
could be astartVat
(with the new engine) on all vats. No heap snapshots would be used across the boundary, and of course all workers are shut down when the application exits.The downside is that normally a vat upgrade will roll back to the previous version if something went wrong (i.e. if the new version failed to
defineKind
everything that the predecessor created, or throws an exception duringcreateRootObject
or contract setup). If we separate the version-NstopVat
from the version-N+1startVat
, there's nothing to roll back to: the heap snapshot is unusable by the new XS image, and the old XS image is gone.The alternative is the approach I originally sketched out: multiple XS versions, each vat incarnation uses a specific one (which does not change, despite
agd
or kernel upgrades), and each vat upgrade changes the flags to start using the most recent one. The deployment process would require anagd
upgrade first, to add the new version, then a bunch of vat upgrades, to switch to it. That involves the governance committee for each contract, in addition to the chain governance and validator coordination necessary to changeagd
. It also requires some more creativity in our source control (maybe having multiplepackages/xsnap-v1/
/xsnap-v2/
subdirectories in a single repo? eww).Or perhaps separating out our
xsnap
into a separate repository(?), so validator operators could do e.g.yarn install -g @agoric/xsnap@$VERSION
to make the new version available asxsnap-v2
/etc instead of a single version living inside theagoric-sdk
source tree. In that case, we wouldn't require anagd
upgrade to makexsnap-v2
available, we'd just announce an intention to requirev2
at some point. All validators would have to pay attention to the announcement and install it promptly, otherwise when the contract governance vote comes through to upgrade the vat, their node would crash.The text was updated successfully, but these errors were encountered: