Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swingset JS-engine heap snapshots #511

Closed
warner opened this issue Feb 6, 2020 · 59 comments
Closed

swingset JS-engine heap snapshots #511

warner opened this issue Feb 6, 2020 · 59 comments
Assignees
Labels
SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Feb 6, 2020

To restart a Swingset vat in a reasonable amount of time, we need a way to capture the full object graph of a vat (the "heap snapshot") and save it to disk, then later reload that snapshot into memory and pick up where we left off. We currently simulate this by recording all the inbound messages (calls to the dispatch object, like dispatch.deliver), which are pure data, and then replay them one at a time at restart time. This takes O(N) space and time, where N is the number of messages that have been sent.

We need engine support to create a snapshot that takes O(A) space and time, where A is the size of the active set: the objects reachable from the vat-side export tables. This is not something that can be done from pure Javascript. We probably need this snapshot to be deterministic.

We expect to extend XS to get this functionality: it's not too much different than the way XS starts up (running some "pre-init" code on the build host, then serializing the resulting object graph into tables that are written out to C code). XS currently reloads the snapshot with the help of a C compiler, so we'd need to change that part.

@warner warner added the SwingSet package: SwingSet label Feb 6, 2020
@zarutian
Copy link
Contributor

This is not something that can be done from pure Javascript

Not easily but it can be done but not from inside of the same js environ.

@dckc
Copy link
Member

dckc commented Feb 12, 2020

@phoddie any chance I missed an easier way to do this?

I see fxPrintHeap but it assumes that the->context is a txLinker*. It's called from main() in xsl.c:
https://github.com/Moddable-OpenSource/moddable/blob/public/xs/tools/xsl.c#L545

I don't see any way to reuse any of that code, so it looks like we'll have to copy and adapt it... all the code that prints gxPreparation and its parts (stack, heap, names, ...). It looks like about 50 lines of code in xsl.c and most of xslSlot.c, which weighs in at 1250 lines.

I had hoped to find some API that I could call to make a tiny demo of a writing a snapshot to a file and reading it back in, but it looks like I need to adapt some ~1500 lines of code (or some working subset of it) to get off the ground.

Hmm... I see fxCloneMachine, which seems to walk the whole machine state. Maybe that would be another approach? It's more like 125 lines.

@phoddie
Copy link

phoddie commented Feb 12, 2020

fxCloneMachine is a runtime operation that creates a new VM on top of a shared (frozen) VM. It is what we use to instantiate a VM from the frozen VM in ROM/flash. It isn't what you want.

The code in xsl.c creates the C source code to build that frozen VM. As you observed, it is not a trivial process. The good news is that it is complete and it works -- we use it every day.

However, what it does is not exactly a snapshot. We only ever use that VM as read-only data. I believe you want to continue executing that VM. In that case, there's some new work to instantiate the VM around it, rather than using fxCloneMachine.

@dckc
Copy link
Member

dckc commented Feb 20, 2020

First, I was making this harder than it is. We don't require snapshot/restore of the xs machine; just a graph of objects, modulo some exits. @warner suggested: data = magicSnapshot(start, [exits]).

Next, @erights suggested ...

debuggers have omnicient access to data that is insulated from inadvertant side effects. I certainly don't want to traverse one JSON RPC at a time. But there's something receiving and acting on those messages. Maybe that's a place to stand to do a transitive complete side-effect-free snapshot of an object graph starting at a root?

"The port of xsbug defaults to 5002 by convention." according to XS Platforms docs.

So I fired up nc -l -p 5002 and launched moddable/examples/helloworld$ mcconfig -d -m. And lo, nc received and reported:

<xsbug><login name="mc" value="XS"/></xsbug>

Based on some poring over xsDebug.c, I uttered <select /> and voila, I got a dump of the globals and modules. I think locals are available, but none were in scope:

<xsbug><global><property flags="+cEw_" name="Array" value="@00007FEC26BBF8C0"/><property flags="+cEw_" name="ArrayBuffer" value="@00007FEC26BC2300"/><property flags="+cEw_" name="Atomics" value="@00007FEC26BC3F20"/><property flags="+cEw_" name="BigInt" value="@00007FEC26BC0660"/><property flags="+cEw_" name="BigInt64Array" value="@00007FEC26BC3100"/><property flags="+cEw_" name="BigUint64Array" value="@00007FEC26BC3220"/><property flags="+cEw_" name="Boolean" value="@00007FEC26BC01C0"/><property flags="+cEw_" name="DataView" value="@00007FEC26BC28C0"/><property flags="+cEw_" name="Date" value="@00007FEC26BC0D80"/><property flags="+cEw_" name="Error" value="@00007FEC26BC1AE0"/><property flags="+cEw_" name="EvalError" value="@00007FEC26BC1BE0"/><property flags="+cEw_" name="FinalizationGroup" value="@00007FEC26BC5800"/><property flags="+cEw_" name="Float32Array" value="@00007FEC26BC3340"/><property flags="+cEw_" name="Float64Array" value="@00007FEC26BC3460"/><property flags="+cEw_" name="Int16Array" value="@00007FEC26BC36A0"/><property flags="+cEw_" name="Int32Array" value="@00007FEC26BC37C0"/><property flags="+cEw_" name="Int8Array" value="@00007FEC26BC3580"/><property flags="+cEw_" name="JSON" value="@00007FEC26BC2180"/><property flags="+cEw_" name="Map" value="@00007FEC26BC4D60"/><property flags="+cEw_" name="Math" value="@00007FEC26BC0E80"/><property flags="+cEw_" name="Number" value="@00007FEC26BC0360"/><property flags="+cEw_" name="Object" value="@00007FEC26BBE940"/><property flags="+cEw_" name="Promise" value="@00007FEC26BC41A0"/><property flags="+cEw_" name="Proxy" value="@00007FEC26BC4900"/><property flags="+cEw_" name="RangeError" value="@00007FEC26BC1CE0"/><property flags="+cEw_" name="ReferenceError" value="@00007FEC26BC1DE0"/><property flags="+cEw_" name="Reflect" value="@00007FEC26BC4980"/><property flags="+cEw_" name="RegExp" value="@00007FEC26BC1880"/><property flags="+cEw_" name="Set" value="@00007FEC26BC5180"/><property flags="+cEw_" name="SharedArrayBuffer" value="@00007FEC26BC3E00"/><property flags="+cEw_" name="String" value="@00007FEC26BBFFC0"/><property flags="+cEw_" name="Symbol" value="@00007FEC26BC45E0"/><property flags="+cEw_" name="SyntaxError" value="@00007FEC26BC1EE0"/><property flags="+cEw_" name="TypeError" value="@00007FEC26BC1FE0"/><property flags="+cEw_" name="TypedArray" value="@00007FEC26BC2F60"/><property flags="+cEw_" name="URIError" value="@00007FEC26BC20E0"/><property flags="+cEw_" name="Uint16Array" value="@00007FEC26BC3A00"/><property flags="+cEw_" name="Uint32Array" value="@00007FEC26BC3B20"/><property flags="+cEw_" name="Uint8Array" value="@00007FEC26BC38E0"/><property flags="+cEw_" name="Uint8ClampedArray" value="@00007FEC26BC3C40"/><property flags="+cEw_" name="WeakMap" value="@00007FEC26BC5480"/><property flags="+cEw_" name="WeakRef" value="@00007FEC26BC56C0"/><property flags="+cEw_" name="WeakSet" value="@00007FEC26BC55C0"/><property flags="+cEw_" name="decodeURI" value="@00007FEC26BBE360"/><property flags="+cEw_" name="decodeURIComponent" value="@00007FEC26BBE3A0"/><property flags="+cEw_" name="encodeURI" value="@00007FEC26BBE3E0"/><property flags="+cEw_" name="encodeURIComponent" value="@00007FEC26BBE420"/><property flags="+cEw_" name="escape" value="@00007FEC26BBE460"/><property flags="+cEw_" name="isFinite" value="@00007FEC26BBE240"/><property flags="+cEw_" name="isNaN" value="@00007FEC26BBE2A0"/><property flags="+cEw_" name="parseFloat" value="@00007FEC26BBE2E0"/><property flags="+cEw_" name="parseInt" value="@00007FEC26BBE320"/><property flags="+cEw_" name="trace" value="@00007FEC26BBE520"/><property flags="+cEw_" name="unescape" value="@00007FEC26BBE4E0"/><property flags=" CEW_" name="Infinity" value="Infinity"/><property flags=" CEW_" name="NaN" value="NaN"/><property flags=" CEW_" name="undefined" value="undefined"/><property flags="+cEw_" name="Compartment" value="@00007FEC26BC5AC0"/><property flags="+cEw_" name="Function" value="@00007FEC26BBED60"/><property flags="+cEw_" name="eval" value="@00007FEC26BBE4A0"/><property flags="+cEw_" name="global" value="@0000560650FCE210"/><property flags="+cEw_" name="globalThis" value="@0000560650FCE210"/></global><grammar><node flags="+CEWM" name="/Resource.xsb" value="@00007FEC26BC67E0"/><node flags="+CEWM" name="/instrumentation.xsb" value="@00007FEC26BC65C0"/><node flags="+CEWM" name="/mc/config.xsb" value="@00007FEC26BC6640"/></grammar></xsbug>

I think value="@00007FEC26BBF8C0" refers to an actual address in memory. I'll continue to poke around...

I did notice a call to fxEchoPropert was guarded by if (key->value.key.string[0] != '#'); so the debugger seems to filter private class fields.

@phoddie
Copy link

phoddie commented Feb 20, 2020

The communication between xsbug and the the host does provide a great deal of information about the running machine, as you would expect. There's no attempt, however, to make that comprehensive. For example, the contents of ArrayBuffer and TypedArray are not displayed, maps and sets are not inspectable, etc. That can be changed, of course, but we have to sensitive about code size as the xsDebug.c code is installed on the microcontroller, which often has very constrained flash memory (1 MB or not uncommon, and less happens). You can could have more debugging capabilities as a compile time option, of course. That said, I'm not sure this avenue is ultimately easier, just different.

@dckc
Copy link
Member

dckc commented Feb 20, 2020

Yes, the debugger's approach to walking the machine state is largely analogous to the the linker's. fxPrintSlot is pretty similar to fxEchoProperty. And the debugger's approach is, as you note, incomplete, unlike the linker.

But for somebody learning his way around the elephant, it's nice to have views from more than one angle.

@dckc
Copy link
Member

dckc commented Feb 22, 2020

ugh! this is driving me bonkers. @warner @michaelfig @jfparadis help?! Here's hoping we can screen-share and you can show me what sort of insanity I'm suffering from.

some slot->kind is getting reset from XS_ARRAY_BUFFER_KIND (17) to 0 (undefined) when I call xsSetArrayBufferLength. I stepped thru with gdb and I can't find where things go wrong.

In the gist below, 9b2c6b8 seems to work fine, but when I try to snapshot a larger string (c37acf0), things go haywire. The gist includes all my debug prints in the README in both versions.

https://gist.github.com/dckc/52e43c8238e6202a0b90e157f0d74ae8

@phoddie
Copy link

phoddie commented Feb 23, 2020

Is it possible that slot is being garbage collected? xsSetArrayBufferLength allocates memory so it could trigger GC.

@dckc
Copy link
Member

dckc commented Feb 23, 2020

Thank you!

I switched from using a local C stack variable for my ArrayBuffer to xsVar(0) and that slot->kind no longer gets clobbered.

@dckc
Copy link
Member

dckc commented Feb 24, 2020

I'm working on serializing an array of two strings. It ends up tracing into prototypes and such, grabbing large parts of the runtime... lots of motivation for the exits arg.

Meanwhile, I had to deal with circular structures; I think I have a solution.

But I'm struggling to get my decoder (which is in .js for now) in sync with the encoder. Details, details...

  • 62a09c8 toward snapshot of complex objects (WIP)

p.s.

  • df184d5 found Array.prototype when walking heap

@dckc
Copy link
Member

dckc commented Feb 26, 2020

A basic array of scalars works now. And cycles.

I think I got exits working. It involves checking references against instances... which I think is kosher...

I started a PR, but there's some syncing to do on the base branch: agoric-labs/moddable#2

@dckc
Copy link
Member

dckc commented Mar 1, 2020

@warner what's a good way to get a small but realistic test case for vat state?

p.s. got some clues; short version: https://github.com/Agoric/agoric-sdk/blob/master/packages/SwingSet/src/kernel/vatManager.js#L272

I think data = magicSnapshot(start, [exits]) is mostly working, though I used a class constructor because that seems to be the most straightforward way to call C from JS in xs.

        const s1 = new Snapshot();
        const rawbuf = s1.dump(root, exits);

@phoddie
Copy link

phoddie commented Mar 2, 2020

Congratulations on the progress. You can have a native function that is not associated with a class:

function snapshot(root, exists) @ "xs_snapshot";

@dckc
Copy link
Member

dckc commented Mar 2, 2020 via email

@dckc
Copy link
Member

dckc commented Mar 2, 2020 via email

@phoddie
Copy link

phoddie commented Mar 2, 2020

The examples for defining native functions in the XS in C document only shows an example of functions in a class. That is the most common case, but it works for stand-alone functions too.

If the documentation had included an example of a standalone native function, would that have helped here?

@dckc
Copy link
Member

dckc commented Mar 2, 2020

Yes; that bit of documentation is one of the places that gave me the impression that a class was the only supported syntax.

@phoddie
Copy link

phoddie commented Mar 2, 2020

Fair enough. The following text has been added to the next push of the XS in C document:

Standalone functions -- functions that are not part of a class -- can also be implemented in C. The @ syntax extension is used where the function body normally appears.

	function restart() @ "xs_restart";

The value of xsThis in the implementation of xs_restart matches the receiver, which is xsGlobal in the following invocation.

	restart();

It turns out there is at least one use of a standalone native function in the Moddable SDK, in the webConfigWifi example.

@dckc
Copy link
Member

dckc commented Mar 3, 2020

next challenge: functions.

making slow progress...

@dckc
Copy link
Member

dckc commented Mar 11, 2020

if a function comes from a module in ROM, re-connecting it to the ROM seems tricky. I think I'm going to assume that ROM functions are all in exits.

@dckc
Copy link
Member

dckc commented Mar 11, 2020

got functions working in a few cases!

https://github.com/dckc/moddable/blob/0be772add377ee0e839e6f4efe84206574867dd5/examples/js/snapshot/main.js#L75-L90

I'm not quite sure how to put the toothpaste back in the tube when it comes to bytecode and closures, but I think I serialized the relevant pieces.

@dckc
Copy link
Member

dckc commented Mar 20, 2020

Restoring data from snapshots works in several cases.

I sat down to work on functions but first I verified my assumption that I could restore scalars, arrays, and such. It was of course more work than I expected but not too bad. (no new C code...)

  • 01a11e9 rebuild after deserializing, up to cycles (functions TODO) …

I invested in static type annotations on behalf of future-me and collaborators (since current-me could barely read the code written by two-weeks-ago-me).

How to restore functions from snapshots?

As to restoring functions, fxNewFunctionInstance from xsFunction.c looks promising; I suppose I need to do what it does...

@dckc
Copy link
Member

dckc commented Apr 13, 2020

Array.fill ... dense

@dckc
Copy link
Member

dckc commented May 17, 2020

I saw another project using https://github.com/protobuf-c/protobuf-c lately; not sure which...

@dckc
Copy link
Member

dckc commented Aug 12, 2020

@warner and I were brainstorming about this and that and he mentioned that compiling vats to wasm could allow us to snapshot wasm instances.

That sounds possibly more straightforward than the spelunking thru the xs internals that I've done to date.

A little searching shows somebody else has tried this and had reasonably good luck with it: wasm-persist:

This allows you to hibernate a WebAssembly instance and later start it up again at the exact place you left off.

hmm...

This only snapshots globals, memory(s), tables.

I wonder if that would suffice.

Some research:
https://acmsocc.github.io/2019/slides/socc19-slides-s1-jeong.pdf

Proto-Faaslets are a way to restore the execution state of a function without rerunning the code that got it to that state.
https://github.com/lsds/faasm/blob/master/docs/proto_faaslets.md

hm. Maybe it's not obvious how to do this after all.

@erights
Copy link
Member

erights commented Aug 12, 2020

Event loop concurrency --- the turn model --- saves us here. We only need to snapshot when the stack is empty. Fortunately, IIRC, the stack is the only WASM state that cannot be snapshotted.

@dckc
Copy link
Member

dckc commented Oct 15, 2020

The condition that the SwingSet kernel needs to detect is: the promise queue is empty. If you can advise us how to do that in the XS world, that would be great, @phoddie .

(originally raised as #45 )

@dckc
Copy link
Member

dckc commented Oct 15, 2020

It looks like you did give some advice in #45 (comment) @phoddie

@phoddie
Copy link

phoddie commented Oct 15, 2020

Sure. Will get back to you tomorrow.

@zarutian
Copy link
Contributor

import { makePromiseKit } from "@agoric/promise-kit";
const alwaysActive = () => {
  const { promise, resolve } = makePromiseKit();
  promise.then(alwaysActive); // .then callbacks always run in next turn of the event loop.
  resolve(null);
}
alwaysActive();

which means that a vat might never become quiescent.

@michaelfig
Copy link
Member

which means that a vat might never become quiescent.

Metering will kill this loop.

@warner warner mentioned this issue Oct 15, 2020
12 tasks
@dckc
Copy link
Member

dckc commented Oct 20, 2020

@phoddie thanks for the setTimer update e8199b101298 ; with that, our "end of crank" test passes:

$ xsnap crankTest.js 
[1,2,3,4,5,6]

next challenge:

throw new Error(
          `Cannot pass non-frozen objects like ${val}. Use harden()`,
        );

for more on that, see endojs/endo#104

@dckc
Copy link
Member

dckc commented Oct 26, 2020

Thanks to the moddable array fix (Moddable-OpenSource/moddable#479 (comment) 748fda93 ) I got further last Weds...

~/projects/agoric/agoric-sdk/packages/xs-vat-worker/build-xsnap$ make  run-xsnap >,out 2>&1

~/projects/agoric/agoric-sdk/packages/xs-vat-worker/build-xsnap$ ls -l ,out
-rw-rw-r-- 1 connolly connolly 92025 Oct 21 12:54 ,out

,out contains:

...
============== harden(a function called makeLocalAmountMath)
enqueue(val, undefined)
enqueue(val, unknown.length)
enqueue(val, unknown.name)
/home/connolly/projects/agoric/agoric-sdk/packages/xs-vat-worker/build-xsnap/output/bundle-functions.js:141: exception: ?.toString: this is no Function instance!
a prototype of something is not already in the fringeset (and .toString failed)
a prototype of something is not already in the fringeset (and .toString failed)
the prototype: /home/connolly/projects/agoric/agoric-sdk/packages/xs-vat-worker/build-xsnap/output/bundle-functions.js:150: exception: ?.toString: this is no Function instance!
// console.log might be missing in restrictive SES realms
?.toString: this is no Function instance
/home/connolly/projects/agoric/agoric-sdk/packages/xs-vat-worker/build-xsnap/output/bundle-functions.js:157: exception: throw!
/home/connolly/projects/agoric/agoric-sdk/packages/xs-vat-worker/build-xsnap/output/build.js:7: exception: throw!

@dckc
Copy link
Member

dckc commented Oct 26, 2020

The problem above was with AsyncFunction, one of the "anonymous intrinsics". @warner helped me find getAnonymousIntrinsics() which let us get a little further. c206eaa

current challenge: doProcess: vat.promise[p+5] fulfilledToPresence failed: Error: unknown promiseID 'p+5'

@dckc
Copy link
Member

dckc commented Oct 28, 2020

With the fix to the snapshotting maps issue we get thru all 70 deliveries!

https://gist.github.com/dckc/2687721039b49c29558cbad9cb9e7baa

@warner
Copy link
Member Author

warner commented Oct 31, 2020

@dckc awesome. Let's try it with v3 (vat-zoe) instead of v1 (vat-alice, use make list-vats to confirm which one is which), that should exercise even more code.

@dckc
Copy link
Member

dckc commented Nov 1, 2020

When I try with v3 (vat-zoe) I get a failure in mustBeSameStructure when running deliver-8.js. body=["zcf"] vs body=...sourcemap....

xsbug shows:

Screenshot from 2020-11-01 15-57-17

And you're right: I hadn't pushed the changes we were using the other day: c206eaa

They're at https://github.com/dckc/agoric-sdk/tree/xs-snap-generator now. (I can't seem to push to Agoric/agoric-sdk).

The tweaks for v3 and a couple others I just made today are in cf6636a .

@dckc
Copy link
Member

dckc commented Nov 1, 2020

The vatSourceBundle for the zoe vat also weighs in at 681K, so I had to tweak xsnap:

~/projects/xs-snap/xsnap/makefiles/lin$ git diff
diff --git a/xsnap/sources/xsnap.c b/xsnap/sources/xsnap.c
index 81e0f4f..dab5f68 100644
--- a/xsnap/sources/xsnap.c
+++ b/xsnap/sources/xsnap.c
@@ -129,7 +129,7 @@ int main(int argc, char* argv[])
                4096*3,                         /* keyCount */
                1993,                           /* nameModulo */
                127,                            /* symbolModulo */
-               256 * 1024,                     /* parserBufferSize */
+               1024 * 1024,                    /* parserBufferSize */
                1993,                           /* parserTableModulo */
        };
        xsCreation* creation = &_creation;

@dckc
Copy link
Member

dckc commented Nov 28, 2020

@phoddie
Copy link

phoddie commented Nov 29, 2020

xsnap is not a runtime for the Moddable SDK, so it doesn't implement support for the Moddable SDK's manifests. JavaScript modules are loaded from source code into xsnap using the -m option.

@dckc
Copy link
Member

dckc commented Nov 29, 2020 via email

@phoddie
Copy link

phoddie commented Nov 29, 2020

I see. xsnap already has several JavaScript functions that are bound to native C implementations. fx_print is a good example:

  • It is implemented using the XS in C API.
  • It is added to the globals of the VM using the internal XS function fxNextHostFunctionProperty for convenience. You could also do this using public function in the XS in C API.
  • It is added to the gxSnapshotCallbacks array to be able to rebind the function when the snapshot is reloaded.

@dckc
Copy link
Member

dckc commented Jan 16, 2021

I think #2194 shows we've got this working.

@dckc dckc closed this as completed Jan 16, 2021
@dckc
Copy link
Member

dckc commented Jan 18, 2021

After talking with @rowgraus , I'm re-opening to represent the relevant end-user feature here: folks running validator nodes can use snapshots to save time when resuming their node. This means not just having a low-level snapshot mechanism, but having the rest of SwingSet , contracts, etc. integrated with it.

It's likely that @rowgraus and/or @dtribble will create another issue to represent this "Epic" in order to integrate with some other tools.

@dckc
Copy link
Member

dckc commented Feb 5, 2021

#2138 represents the end-user feature here.

@dckc dckc closed this as completed Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

6 participants