Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-deterministic GC #1872

Open
2 of 12 tasks
warner opened this issue Oct 15, 2020 · 40 comments
Open
2 of 12 tasks

non-deterministic GC #1872

warner opened this issue Oct 15, 2020 · 40 comments
Labels
enhancement New feature or request needs-design SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Oct 15, 2020

(this tries to capture @dtribble 's "radical GC idea" from our 14-oct-2020 kernel meeting)

Task list:

  • chore: provide GC tools (WeakRef/FinalizationRegistry) to makeLiveSlots #1924 provide WeakRef/FinalizationRegistry to liveslots, or no-op stubs when platform does not have them
  • fix: add vatDecRef to kernel, offer to liveslots #1926 add kernel decref function (only logs), provide vat-specific vatDecRef to liveslots
  • fix: add WeakRef tracking to liveslots #1927 change liveslots import-side table (slotToVal) to retain weakrefs to Presences. Finalizers call vatDecref with count (excluded from replay fidelity checks). Keep strong refs to non-dropped objects. Tests: run with node --expose-gc, have the test program use gc() to provoke a drop, examine vatDecref calls.
  • add import-to-vat refcounts to c-list entries, increment in vatTranslator during delivery. Expose in c.dump(). Testing: examine c.dump() counts
  • introduce between-crank GC action phase
  • during GC action phase, execute vatDecref decrements. testing: examine c.dump() counts
  • when c-list refcount drops to zero, delete c-list entry. Do not propagate further yet. Testing: examine c.dump() clist entries.
  • enhance existing kernel table refcounts to include objects too
  • when c-list entry is deleted, trigger kernel-wide examination of refcounts. If export is the only remaining reference, delete OT entry, delete exporting c-list entry, stop there. Testing: examine c.dump()
  • add dispatch.dropExport to liveslots, delete slotToVal and exports entries. Testing: not sure
  • send dispatch.dropExport when deleting the exporting c-list entry

What is the Problem Being Solved?

@FUDCo estimated about 23 new kernel objects being created for each cycle of a basic "exchange" benchmark. Each of these objects consumes RAM in the liveslots object tables, as well as disk in the kernel clists.

Many of these objects will be Purses or Payments or other "core objects", which will be moved to secondary storage (disk) via #455 and some kind of makeExternalStore-style constructor. However there are likely to be several which aren't so core (e.g. a Notifier), and aren't so likely to migrate: it would be a significant ergonomics burden if we asked programmers to make sure every single object they created and exported were put into an external store.

The remaining objects will cause memory pressure, perhaps less than 23 per cycle, but still a significant problem. We need a GC solution. Vats which import an object and then cease to reference it must signal the kernel (somehow) that the entry is no longer needed. The importing vat's c-list entry should be removed. When all clients of a kernel object (c-lists and resolution data in the kernel promise table) are gone, the exporting vat's c-list entry should be removed, and the exporting vat should be signalled that the exported object can be dropped. The exporting vat should remove the entry from its liveslots tables, allowing the original object to be dropped.

The big problem, which has blocked us from making much progress on this front, is determinism. JavaScript now has WeakRefs, which let us attach a finalizer to fire when an object is dropped, but it does not many any guarantees about when this finalizer runs. It could be immediately after the last strong reference is dropped, or a minute later, or a year later. In a chain-based swingset machine (i.e. cosmic-swingset), if the GC behavior is observable, consensus among multiple validators (with perhaps different memory pressures) could not be reached.

Previous Plan (which probably wouldn't work)

The setup: suppose vat A is a client of some object in vat B. So vat A has a Presence that wraps some imported object ID (o-4), vat A's c-list maps o-4 to a kernel object ko5, vat B's c-list maps ko5 to an exported o+6, and vat B's liveslots tables map o+6 to e.g. an exported Purse object. Vat A's liveslots tables manage the import by having slotToVal (a strong Map) map o-4 to the Presence, and valToSlot (a WeakMap) map the Presence back to o-4, and the initial problem is 1: we aren't using a WeakRef anywhere, and 2: the strength of slotToVal would prevent the Presence from being collected even once the vat drops it. On the exporting side, Vat B's liveslots table has slotToVal mapping o+6 to the original callable Purse object, and vatToSlot mapping the purse back to the string o+6.

Previously, I was expecting a system in which liveslots uses WeakRef to detect Presences becoming unreferenced: we change slotToVal to use values which are new WeakRef(Presence), and add a finalizer/notifier (FinalizationRegistry) to sense when userspace has dropped it. At that point, liveslots removes the slotToVal entry (knowing the valToSlot entry is already gone), at which point it would send a syscall.drop('o-4') to the kernel. The kernel reacts to that by removing the vat A c-list entry for o-4, which then decrements the ko5 reference count. If/when all other importing vats (and any resolved promise data) cease referencing ko5, such that the only remaining reference is from the exporting vat (i.e. the kernel object table records vat B as the owner of ko5, and the vat B c-list is the only remaining reference), then the kernel deletes ko5 from the object table and delivers a dispatch.drop(o+6) to vat B. When this is delivered, the kernel deletes the vat B c-list entry, delivers the DROP, and then vat and kernel agree to "never speak of this ID again", just like they do for retired promise IDs.

If it weren't for the nondeterminism of the FinalizationRegistry, this would probably work. But the timing of finalization calls is far to uncertain for us to rely upon in a consensus machine. Even if we require all validators to run the same version of the same engine, the behavior will depend upon memory pressure and perhaps the past history of the engine (heap fragmentation, etc). We aren't even sure we could use this in a non-consensus machine, because simply restarting the solo node (replaying the transcript) might result in different finalizer calls on the second time through, breaking consistency with the previous transcript.

What's New

@dtribble 's question was: what if GC wasn't visible to consensus? By relaxing the definition of consensus, we would allow vat execution to vary somewhat between validators: the messages they send must still be identical, but their GC decisions don't.

In this scheme, we'd still have liveslots use a finalizer to sense the client dropping their last reference. Vat A would still notify the kernel about the DROP, but it wouldn't do it with a syscall, or at least the syscall it uses would not be part of the consensus state. Our transcripts record a list of deliveries, each with a list of syscalls made during the processing of that delivery. Vat replay means re-submitting each delivery, and comparing the syscalls made by the new vat against the ones recorded by the old one. In this case, we'd exclude syscall.drop from this comparison, allowing divergence between the previous run and the new one.

To be precise, we should distinguish between several different places where we compare the behavior of one execution trace against another, for different reasons:

Orthogonal Persistence

Orthogonal persistence is accomplished by starting from some initial state (possibly a #511 heap snapshot) and then replaying a transcript of previous deliveries. Each delivery includes a set of syscalls that were made in the first pass, along with the results of syscalls which include results (device reads, secondary-storage reads). We don't intend to give vats the ability to behave differently on the second pass, but to detect bugs in the surrounding code earlier, we want to detect replay divergence as soon as possible, so we currently compare the syscalls and their arguments against the transcript, and raise a vat-fatal "anachrophobia error".

For the narrow purpose of replay, the transcript really only needs to retain the deliveries, and the return values of the syscalls. When we add early divergence detection to the list of requirements, it would suffice to include a hash of the syscall arguments. In both cases, we can truncate the transcript each time we take a heap snapshot. (Note that we may record the full transcript, with all data, for debugging purposes, in particular to rebuild a vat locally under a debugger, and/or with additional logging enabled. This record is excluded from consensus, is generally write-only, and is independent of the one used for orthogonal persistence.)

Validator Consensus

In a high-credibility machine (i.e. one embedded in a blockchain), the behavior of the composite machine should not depend upon the behavior of any small subset of the validators. Only decisions made identically by a suitable majority (typically 2/3rds) of the validator power should be recognized by clients as authentic behavior of the "chain" as a whole.

But not all decisions need to fall into this category. Validators do not expose all of the inner workings of the shared application. In Cosmos, only data that is stored into the Cosmos state vector needs to be identical across validator nodes (and only that data can be validated by clients: anything else a client might be told will not have the same credibility as something in the state vector).

Our current rule is that the swingset kernels (and the vats they host) can "think" whatever they want, but anything they "speak" must meet consensus. The swingset-on-cosmos kernel writes outbound messages into the cosmos state vector for delivery to other systems (either through IBC or our initial custom protocol). Different validators could conceivably have vats which behave in different ways, as long as the comms-layer externally-targetted messages they create are identical. In practice, however, if the vat internals diverge, the messages are likely to diverage as well.

Fast Validator Catch-up

(This might be known as "fast sync" or "smart sync" in some environments.)

When a new validator joins the chain, it needs to acquire a set of vats which have the same internal state as the existing validators, so they can evolve in the same way, so the validator can produce blocks that the other validators will accept.

The slow-but-steady way to achieve this is to replay the entire chain from the beginning. The new node must find someone willing to feed it with a list of all blocks, starting with the genesis, and replay the transactions in each one. This causes the application state to evolve, one transaction at a time, producing new blocks. At each step, the node checks the consensus signatures and confirms that a threshold of validators has approved the block, and that the locally-generated block matches the one approved by the existing validators. A subset of these blocks will contain messages aimed at the swingset module, and processing these messages will cause the swingset kernel+vat state to evolve, following the same path (and generating the same outbound messages) as the swingset module in those earlier validators.

If the chain was mostly CPU bound the first time through, the new validator will probably be equally CPU bound. They have the advantage of not needing to wait for new blocks to be produced, but their only chance to catch up will be through the idle gaps in the earlier block history. As a result, new validators may take an extremely long time to reach the current state and become prepared to produce new blocks of their own.

The faster way to get caught up is to receive a pre-evolved state vector from some existing validator. If the data included the entire application state, and If the new node could rely upon the correctness of that data, it could simply drop the state vector into its empty and then pretend that it had just rebooted after being fully caught up. It would not have to replay anything, and a new validator could be brought online a few moments after receiving the state vector.

Those caveats point to two conditions that must be addressed. The first is completeness. Swingset does not keep all its state in the cosmos state vector, in fact much of it is not kept on disk at all. The majority of the swingset state is in the RAM heap of the JavaScript engine, dispersed among all the vats. The rest is in the LMDB-based "KernelDB", on disk, which holds the c-lists and vat snapshots/transcripts. The RAM state can be regenerated, however, from the vat transcripts. So, given extra startup time, we can pretend that all swingset state is kept on disk. When catching up, we can take a faster path by starting from a vat snapshot, or by lazily loading of vats (prioritizing the active ones, and deferring the idle ones). We can also parallelize vat reloads in multiple CPU threads, because each vat runs independently.

The second condition is the new validator's ability to rely upon the state vector it has received. This vector must be signed by an appropriate majority of the validators, otherwise a new validator could be tricked by a minority subset, violating the credibility properties. It is not sufficient for the other validators to merely examine and approve the output messages of the new validator: this demonstrates that the current behavior is correct, but does not protect against future divergence. In fact, if the malicious state provider successfully convinces the new validator to accept a state bundle that includes an #1691 -style "sleeper agent" (which doesn't activate until much later), and the new validator is, in turn, relied upon by other new validators to get that state bundle, a patient attacker could manage to eventually convince all validators to believe a state vector that does not match the original, and obtain arbitrary control over the entire chain.

For this reason, whatever fast-sync state vector a new validator might use must be subject to the same consensus rules as the blocks used to derive that state. In practice, this means all validators must periodically produce the vector themselves, and include its hash in their blocks. For these to be compared, all validators must produce the same vector (anything that goes into this vector is "consensus-critical").

To use this for swingset, the vat snapshot, transcripts, kernel c-lists, and kernel object/promise tables must all be identical among all validators.

Description of the Design

So, the proposal from today's meeting is:

  • the liveslots slotToVal table for imported objects will map object IDs to WeakRefs (of Presences) rather than mapping to the Presence directly
    • (I think exported objects must be retained by strong references, which suggests we might need two tables)
  • a FinalizationRegistry is used to sense when the target Presence has been GCed
  • the registry's cleanup callback will:
    • remove the slotToVal mapping
    • perform a syscall.drop(objectID) call, to inform the kernel that the entry is no longer needed
  • the kernel reacts to syscall.drop by removing the entry from the vat's c-list, and (immediately? eventually?) performing a GC sweep through the c-lists
  • at some point, this may result in the kernel object ID becoming unreferenced by everyone except the exporting vat
    • the kernel will immediately deliver a dispatch.drop(objectID) to the target vat (bypassing the run-queue, so as to not give the target vat an opportunity to emit a new reference to the about-to-be-deleted export, which would need to cancel the drop)
    • the kernel will translate the kernel object ID into the vat ID, delete the entry from the c-list, then deliver the drop(vatObjectID)
  • the receiving vat's liveslots will react to the dispatch.drop by deleting the slotToVal table entry, which will drop liveslots' reference to the original exported object
    • the valToSlot WeakMap may still include a (weak) reference, if other code within the vat retains it
    • therefore, if the vat ever re-exports the same object, it will re-use its original object ID, un-dropping the ID. As long as both sides of each vat-kernel boundary agree on the drop and export events, this shouldn't cause a problem, but it might be better to somehow invalidate the old object ID and allocate a new one (which would probably involve changing valToSlot to point at some cell instead, retaining a strong slotToCell mapping, and storing an isValid flag in the cell next to the objectID).

Consensus Consequences

For orthogonal persistence purposes, we must not include the drop syscalls in the replay comparison check: a replay delivery is considered successful if it makes exactly the same non-drop syscalls that the transcript recorded. This means filtering out any drop syscall before checking the transcript.

Should transcripts include the dispatch.drop deliveries, such that replayed vats are told to drop exports in the same way their predecessor was told? Certainly they need to be told eventually, otherwise they'll use more RAM than their predecessor did. But we can tell them earlier or later without problems. One suggestion was to have replay watch the syscalls and build up a list of exported object IDs, then compare it against the current c-list, and immediately submit a bunch of drop deliveries for the missing ones (at the end of replay, before any new messages are delivered). When we move to snapshot+transcript, we might track "dropped exports" in a special set, so that the current (minimized) vat state can always be reconstructed by 1: loading the heap snapshot, 2: replaying the truncated (post-snapshot) transcript, and then 3: delivering drop for everything in the set. We can truncate the drop set when we make a new snapshot.

For validator consensus purposes, where we only put externally-targetted messages into the Cosmos state vector, it doesn't matter than some validators have a drop where others do not. As long as we aren't putting hashes of kernel (c-list) state or vat heap snapshots or drop-bearing transcripts into that state vector, validator behavior can safely diverge along the narrow drop/not-drop axis.

For fast-sync purposes, we have a problem. Heap snapshots where GC has happened (especially where dispatch.drop has been delivered) will be completely different than in vats where it has not. Likewise c-lists will be missing entries where drop took place. These states will differ between validators. As a result, the data we use for fast-sync cannot use heap snapshots if drop is in use. We could work around this by only fast-syncing full from-the-beginning transcripts (which we would have to retain for this purpose, since normally we would truncate the transcript each time we took a heap snapshot).

The varying c-lists, however, is harder to deal with. The full swingset state includes the kernel object/promise tables, the c-lists for each vat, and the snapshot+transcript for each vat. Validators can't agree on the kernel state if the c-lists will depend upon whether a given vat did GC and emitted a drop or not.

@dtribble pointed out one workaround would be to abandon vat-at-a-time replay and instead/additionally record a totally-ordered full-kernel transcript (basically a list of every item added to the run-queue). Replaying this list would regenerate the c-lists as well as the internal vat state, which would remove the need to publish (and agree upon) the c-list contents. The cost would be:

  • validators must store information for fast-sync that would normally be discarded
  • fast-sync could not use snapshots to avoid playing an entire vat's transcript
  • fast-sync could not replay multiple vats in parallel
  • fast-sync could not lazy-load vats
  • fast-sync would take nearly as long to complete as the original runtime took to generate the kernel history: cosmos operations would be short-circuited, but every swingset operation must be repeated

If we went that direction, we could have a system in which "fast-sync" uses one set of data and "reboot catch-up" uses a different set. As @erights pointed out, nodes can safely rely upon their earlier history more so than upon data provided by outsiders. So the slow processing of "fast sync" might be a necessary compromise to allow GC to proceed.

Additional Considerations

Vats should not get access to the WeakRef or FinalizationRegistry constructors, so they should not be able to sense objects being GCed. No user-level vat code should run in reaction to a dispatch.drop: only liveslots will have control.

How should this interact with secondary-storage "virtual objects" (#455)? Ideally, when the exporting vat receives the drop, it should delete the relevant entries in secondary storage (which might cascade into deleting other virtual objects, somehow). However, even if the secondary-storage syscalls are not consensus-critical, the DB entries which back them are just as consensus-relevant as the vat's c-lists and transcript. This suggests fast-sync cannot include secondary storage either, since it may be missing contents that were released by GC, pushing further into the "replay a full kernel transcript" approach.

Comms vat: when GC is distributed across swingset machines, the drop message becomes externally visible (and thus subject to much stricter consensus rules). This is the point where we need deterministic behavior. I fear this is "purple box" -level problem (thesis-level complexity).

Security Considerations

We have to be very careful to not allow user-level code observe any non-determinism exposed by the GC mechanism, through a combination of:

  • vat code will not have WeakRef or FinalizationRegistry in their globals (SES should filter these out from non- start Compartments because of their nondeterministic nature, just like Math.random and Date.now), so they cannot sense GC directly
  • liveslots receives drop messages and acts upon them, but must not allow that to change the behavior that vat code can observe

The consequences of a leak would be that vat code (probably deliberately written to exploit the leak) could cause consensus failure: some validators would be slashed for behaving differently than others. If the divergence is severe enough, the chain could halt entirely if too few validators were able to agree upon new blocks.

The fast-sync state must be validated correctly. If an attacker who supplies alleged state data can get a new validator to accept malicious state, they can control the output of that validator, and eventually (with enough patience and time for the bogus data to spread) that of the entire chain.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Oct 15, 2020
@warner
Copy link
Member Author

warner commented Oct 15, 2020

How should the kernel keep track of which imports are still being used?

For each vat, the deterministic ("userspace") computation has a specific set of live imports: the Presence provided to userspace is either reachable from userspace or not. This reachability is the same across all executions of the same vat during that same delivery, regardless of which validator it is on, or how many times the vat has been reloaded from snapshot and replayed from transcript. Let's call this the "actual liveness".

The liveslots layer, using its WeakRef and FinalizationRegistry, senses a slightly different property, which we'll call "apparent liveness". An import which is actually dead may still be apparently alive, because the finalizer has not yet fired.

For each imported object (o-NN), the kernel can track when that object was provided to the vat: this marks the start of the liveness window. The kernel will also track when liveslots emits a drop message, which marks the end of the apparent liveness window (the actual liveness window is not visible to either liveslots or the kernel, but it is bounded by the apparent window).

If, on one run, the apparent window begins at delivery 4 (the kernel includes o-NN in a delivery argument for the first time), and ends at delivery 8 (when the vat does a drop syscall), the kernel knows that 1: the actual liveness window ended somewhere in delivery 4 through 8 (inclusive), and 2: the object is actually dead in delivery 9 and later. It is vat-fatal for the vat to reference the object ID in syscalls made during delivery 9 or later, unless the kernel re-introduces the ID in some new delivery. Let's pretend that we know the actual window ended during delivery 6, and it took two more deliveries before the GC infrastructure notified the finalizer.

Now, suppose the vat is reloaded from transcript. On this second run, the drop message appears during delivery 7. If the kernel remembered the previous run, it would know that the actual liveness window is still bounded by the earlier of the edges. If, in a third run, the drop appears during delivery 6, that narrows the window. If the kernel has ever observed a run where the drop appeared during delivery 6, then it could be vat-fatal to mention the ID during delivery 7, even if the new run hasn't yet emitted a drop. Except, ugh, vat-fatal errors are made visible to other vats, so we must be careful to not use this extra knowledge in a way that could reveal it to vats.

What I'm trying to sort out is what exactly the kernel needs to track. I'm thinking that the kernel is allowed to delete the import from the c-list as soon as it sees the drop. The c-list is not versioned: we only retain the latest copy. When the kernel reloads the vat, if the drop is emitted earlier this time around, that's fine, because the syscalls are being ignored during replay anyways. If it is emitted later, the kernel will observe a drop for an object ID that isn't currently in the c-list (because it was deleted during the previous run), so we must tolerate that case (silently ignore).

On the other edge of the window, if the kernel sees a syscall with an object ID that isn't in the c-list, that definitely means the vat is making an illegal access. But I think there are three cases:

  • the vat is accessing an object that it was never given, in which case a vat-fatal error is totally appropriate (it is a deterministic function of the vat's behavior, without the influence of GC timing which the vat is not supposed to be able to sense)
  • it once had access to the object, but it gave up that access by emitting a drop earlier during this run. This is clearly illegal behavior, but depends upon GC timing.
  • it once had access to this object, and the kernel has not yet seen a drop in this run, but on some previous run the vat did emit a drop, so that previous kernel removed the c-list entry. This is illegal behavior, but the kernel only knows that because it remembers GC activity from the previous run. The new kernel has contraband information from a previous run, giving it a narrower bound on the actual liveness window.

How do we keep the kernel from accidentally revealing its knowledge of GC activity (to other vats) when a vat references a previously-dropped import? If the kernel deletes the c-list entry when it receives the drop, it won't have enough information to handle a later syscall which references that ID. So it must either kill the vat, or retain the supposedly-dropped c-list entry
so it can pretend that the object is still alive.

What exactly is the pathway by which the kernel might accidentally reveal its forbidden GC knowledge? If the kernel kills the vat when it references a previously-dropped object ID, that will have two effects: a message is sent to the vatAdmin (to notify the vat's parent, by resolving the done promise), and all subsequent messages sent to the dead vat's objects will reject (with a VatTerminatedError). These messages get pushed onto the run-queue, which is part of the saved kernel state (for local reload), but not part of the observable consensus state (validators agree on externally-targetted messages, not the internals).

However, once those termination messages are delivered, they may provoke externally-targetted messages, which is part of the validator consensus. So all validators must choose the same time to kill the vat, give or take some wiggle room: I think it depends upon the order of other messages currently in flight. If one validator's vatAdmin were notified one crank later than the other validators (still within the same block), but there were no external messages happening about the same time, then an outside observer might not observe a difference in the external message ordering, and might not notice the divergence. However we can't depend upon that.

So this is making me think the kernels really do need to agree upon when the vat is dead: the vatAdmin message must be delivered on the same crank across all validators. That limits our options. The kernel's drop-gleaned GC information is not admissible evidence for the crime of illegal syscalls. The kernel must pretend that it didn't see the drop, which means it needs to keep the c-list entry around (so it can route the illegal message).

That points to a new "live" column in the kernel object table. drop clears this flag, and triggers the delivery of drop to the exporting vat, but isn't allowed to influence the illegal-syscall check.

I'm still pondering what this means for the exporting vat.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

@FUDCo and I discussed this some more. Both the "when should a vat be killed for an illegal syscall" question and the "when should the comms vat send a drop" question are troubling: they both represent consensus-visible behavior that depends upon GC timing that won't be consistent.

One wild idea we explored was a "drop oracle". The idea would be that the kernel somehow reacts to a drop syscall by telling an external program (maybe something in cosmos-sdk, but not part of consensus) that this particular validator believes o+NN has been dropped. This program then creates a signed transaction message to that effect, and submits it back into the chain. As an externally-introduced transaction, all validators will see it. The message is routed to a voting/governance cosmos-sdk module that knows about the validator staking/power, and that module accumulates drop messages until it sees a sufficient threshold of votes reached. At that point, the voting module submits a "drop the following object IDs" message to swingset, which then does the decref and maybe tells the exporting vat to drop if it was the last reference.

Vat termination for illegal syscalls could maybe be done the same way. Effectively all nodes vote a reference (or an entire vat) off the island. Death by committee.

The downside is, of course, that we're kind of inventing our own consensus layer, and that was on our list of things we shouldn't do. Drafting off the existing voting module probably reduces the hazards, but it still raises the question of whether the incentives are correct. If one validator maliciously claims that they've seen a drop from some vat, when nobody else has (and in fact their vats are still talking about that import in syscalls), what are the consequences? How would anyone else even tell that they're doing something malicious? Maybe the validator must periodically vote with a list of all the object IDs that they claim to have seen dropped, and equivocation can be grounds for slashing. (Does the existing cosmos-sdk voting/governance module have provisions for slashing to discourage vote equivocation? Or are votes one-sided and the only proof needed to carry the measure is a plurality of "yes" votes, because you can't even cast a "no" vote?)

@dtribble
Copy link
Member

The only viable answer is that it cannot make an illegal syscall.

If vat A exports an object O under kernel key K1, the incoming drop means there are no references to O from outside A. That was true at the end of the actual liveness window, above. So no external event can cause a use of the kernel's key for A. But there may be references inside A. So the only problem is if A exports O again. Does it reuse K1 or does it get another key?

A simple fallback answer is that it uses K1. When the drop arrives, liveslots doesn't drop the reference to O, it moves it to a weak pointer. Thus, as long as O is retained in A, if it's exported, it will be exported with the same K1. Only when O is dropped in A as well will the kernel be told it will never see K1 again.

I think we can do better than that, but the fallback is implementable. Foe example, resolved promises will never be exported again. It's really only presences for which this could be an issue.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

That sounds like a decent approach for the exporting side. I don't think it matters too much whether we reuse the exported ID or allocate it a new one. Although we need more bookkeeping if we want to allocate a new one (the valToSlot WeakMap might still remember the old ID, but in this case we're not supposed to use it, so we'd need to change that map to point to a record that is also pointed to by a strong map, and contains a flag that says "don't use me, allocate a new ID instead"). OTOH if we allow the ID to be reused, that might complicate some kernel-side data structure that keeps track of which IDs are alive or not.. haven't thought about that structure yet. My hunch is that reusing IDs will be simpler overall.

My concern has mostly focused on the importing side. But I think it's safe to say that illegal syscalls can only happen if the liveslots layer is buggy. If liveslots is functioning correctly, then even a malicious userspace should not be able to trigger an illegal syscall. So it's a question of how we define the TCB. I don't want to give liveslots the authority to break the integrity of any other vat than its own, but maybe our fallback position is that we do give liveslots the authority to break the swingset as a whole (by making syscalls whose legality depends upon GC-sensitive nondeterminism), and rely upon our implementation to not exercise that authority.

I don't see how that helps us with cross-swingset (comms-vat) drops, alas.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

Hm, the TCB argument is stronger than I realized: of course we're giving liveslots access to nondeterminism, because that's the only way to get GC information about what userspace is doing. So of course we're forced to rely upon liveslots not misusing that authority. Which means I think we can stop worrying about a buggy liveslots causing divergence, because we don't have any other choice.

So that leaves comms-vat drops. Ugh.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

Oh, and we probably can have the comms-vat drop references from a solo machine, which might mitigate most of the problem: how frequently do our solo-side clients provide objects to the chain, rather than the other way around?

@warner
Copy link
Member Author

warner commented Oct 16, 2020

On the export side, I think we should go with arrival-order nondeterminism.

Each drop syscall will check the c-list of the no-longer-referencing vat. If present, it will remove that entry and add the kernel object ID (koNN) to the "maybe-free" list (which lives in the persistent DB). If not present, the syscall is ignored (this will happen when we did a suspend/reload since the import was visibly dropped, and the new vat's finalizer happens to run later than the previous one did). We want to wrap up our interaction with the previous vat before we do anything involving other vats, so at the end of the crank, we commit the DB as usual. In addition to the normal crank buffer DB changes, for each non-redundant drop, we expect to see one addition to the maybe-free list, and one removal from the c-list (of the dropped import).

Then, before moving on to the next item on the run-queue, we pop each item from the maybe-free list and search for remaining references. This references might come from other vat c-lists (worst case it could scan them all, best case we maintain a reference count), or from the resolution/rejection data of unretired promise table entries. If any references still exist, the kernel does nothing. If there are none, the kernel reads the owner (exporting) vat of the object, deletes the object table entry, and performs a delivery of dispatch.drop to the vat.

This delivery will look up the target kernel object (koNN) in the exporting vat's c-list, delete that c-list entry, then call the dispatch.drop function of the target vat with the corresponding vat object ID (which should always be an export, hence of the form o+NN).

This delivery is expected to cause liveslots to delete the slotToVal (strong) Map entry for o+NN. (We may want to rename this map to something with exports in the name, to distinguish it from the weak references held by the import map). If nothing else in the vat is holding on to that exported object, it will become eligible for collection, which might cause other objects to be released, which might cause the vat to drop some imports. Hence the delivery of drop might provoke new syscall.drops to be emitted during delivery. These new drops will be removed from the c-list (if present) and pushed onto the maybe-free list for later processing, like above.

The dispatch.drop delivery is added to the exporting vat's transcript as usual. If/when this vat is rebuilt from snapshot+replay, we must re-delete the strong Map entry. The order of the deleting is fixed by the kernel when it decides to inform the exporting vat of the drop (arrival-order nondeterminism).

Once the dispatch.drop delivery is complete, the kernel commits the crank as usual. The contents of the crank buffer are expected to be:

  • one removal from the maybe-free list
  • if that was the last reference:
    • one removal from the exporting vat's c-list
    • one dispatch.drop addition to the vat's transcript
      • no syscalls other than syscall.drop: userspace should not get control
    • if that caused some imports to be dropped:
      • additional removals from the c-list
      • additions to the maybe-free list

The kernel should probably drain the maybe-free list before returning to the run-queue.

bikesheds:

  • name for the import-dropping syscall
  • name for the export-dropping delivery (it might be nice to have a name that's different from, but clearly related to, the syscall)

minor design questions:

  • layout of the new (split) import/export tables
  • persistence of the maybe-free list

larger design questions:

  • I'm thinking this should strictly be for Presences, and should ignore Promises, at least for now, because Promises can be resolved and then retired (which handles some portion of our GC needs), and we need to figure out some additional design:
    • does a .then callback keep the Promise alive? We'd need that to be true, to safely use a WeakRef on the import-side Promise to know when we can drop it
    • we'd need to examine the subscription tables and make sure they're updated correctly
    • the notion of "import" and "export" is less clear, and we need our refcounts to only measure imports, not whatever "export" means. (if a promise resolves in the woods, and nobody is around to hear a callback, did it make a reference?)
    • however, being able to immediately drop an unused result promise would be awfully nice

@warner
Copy link
Member Author

warner commented Oct 16, 2020

Hm, if we did apply this to Promises as well, then a crank which does an implicit sendOnly (i.e. x~.foo(); rather than x~.foo().then(stuff); or const saveForLater = x~.foo();) would potentially drop the result Promise so quickly that the finalizer would fire in the same crank, which would be ideal. The syscalls we'd observe would be:

  • syscall.send with result=p+NN
  • syscall.drop(p+NN)

The send will create a promise-table entry whose "decider" is set to null (because the send on the run-queue is the decider), and lists the sending vat as the only subscriber. The drop should ideally delete the promise-table entry and change the run-queue entry to remove the result= field, since there's only one subscriber and they just unsubscribed. When this crank is committed, the DB changes should be indistinguishable from ones made by a vat that had done an explicit sendOnly (with result=null and no drop).

If the sender actually uses the Promise (perhaps they pipeline a message to the result, or they share the result promise with some other vat), then this is less clear. Pipelined messages whose ultimate result promise is dropped should also drop the intermediate results. Shared promises should not be dropped until the last subscriber has dropped.

The "imports" of a promise-table entry are the subscribing vats, plus the run-queue messages that target it. The "exporter" of a promise-table entry is the deciding vat, if any, or the run-queue entry which holds that promise as a result.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 16, 2020

Although we need more bookkeeping if we want to allocate a new one (the valToSlot WeakMap might still remember the old ID, but in this case we're not supposed to use it, so we'd need to change that map to point to a record that is also pointed to by a strong map, and contains a flag that says "don't use me, allocate a new ID instead"). OTOH if we allow the ID to be reused, that might complicate some kernel-side data structure that keeps track of which IDs are alive or not.. haven't thought about that structure yet. My hunch is that reusing IDs will be simpler overall.

My intuition is the opposite: that generating new IDs will be much simpler, because it means we can retire old ones when we think they're dead instead of having to keep track.

@warner
Copy link
Member Author

warner commented Oct 18, 2020

Task list (obsolete, use checklist in first comment instead):

  • fix: add WeakRef tracking to liveslots #1927 change liveslots import-side tables to retain weakrefs to Presences. Finalizers log but nothing else. Probably need to split slotToVal/valToSlot into separate import- and export-side maps, to accomodate different strong/weak needs. Tests: run with node --expose-gc, have the test program use gc() to provoke a drop, manually inspect the log.
  • fix: add WeakRef tracking to liveslots #1927 add syscall.drop. kernel-side handler ignores it. drop is excluded from replay fidelity checks (or excluded from the transcript entirely, must decide). Testing: use gc() from test program, assert syscall.drop is in transcript (or other test-visible record)
  • syscall.drop deletes importing vat c-list entry. Testing: non-liveslots vat to call syscall.drop directly (no gc()), examine clist after gc().
  • implement maybe-free DB list, syscall.drop adds to list, list is serviced before run-queue, but "service" is a no-op. Testing: kernel-side, no liveslots
  • implement refcount check, service-maybe-free determines if object is now only referenced by exporting vat, but action is to merely log. Testing: kernel-side, no liveslots
  • implement dispatch.drop, but liveslots ignores
  • implement liveslots dispatch.drop by deleting exports-map entry. Simultaneously change kernel implementation to delete exporting-vat clist entry. Testing: not sure, maybe manual.

@warner
Copy link
Member Author

warner commented Oct 20, 2020

Hm, I just thought of a wrinkle: the FinalizationRegistry might run the notification callback at any time, which means it might be called in between cranks, or when the vat is not supposed to be active. That means we can't immediately send a syscall.drop when it fires: instead we must wait until the vat is supposed to have agency, which means when the next inbound delivery is made. (Liveslots doesn't learn when the crank is finished, merely when the crank is begun, and when invocations are made which generate syscalls).

Worst case, a vat which is told to shut down (and politely deletes all the internal references it can, to accelerate GC) might not actually see the finalizers run within that same crank. And, if nobody ever talks to that vat again, it won't ever get agency again, which means we won't have a safe place to emit the drop syscalls, leaving that vat's references fixed in place, unable to be collected. Ugh.

One slight mitigation would be to have the notifier set a flag and check that flag each time liveslots gets control (so each time an outbound message is sent). That might increase our chances of getting the drop out, but it's no guarantee.

Another approach would be to add an explicit delivery type whose only job is to check the finalizer flags and report any dead references. We might call this immediately after every crank finishes. I don't like the overhead this would represent, but it's hard for me to resist the idea of adding a perfectly-named dispatch.bringOutYourDead method. It might be sufficient to call it once every N cranks or something.

It might also be possible to blur the line between vat and kernel a bit more, and have these notifications go directly to the kernel, rather than being handled entirely by liveslots. In particular, if the kernel didn't learn about them through a syscall (but rather through some other channel that travels the same kernel-worker protocol as syscalls), then it might be ok for them to occur outside the usual "only during a delivery" window. This would also remove them from the transcript entirely, which means one fewer exception to implement.

@erights
Copy link
Member

erights commented Oct 20, 2020

This would also remove them from the transcript entirely, which means one fewer exception to implement.

Would the syscall interface go back to carrying only deterministically logged things? If not, what are the other exceptions?

@erights
Copy link
Member

erights commented Oct 20, 2020

Worst case, a vat which is told to shut down

Is shutting down distinct from terminating?

If a vat is terminating, why do we care what drops it reports before terminating. Terminating will deterministically drop everything anyway, yes?

@warner
Copy link
Member Author

warner commented Oct 20, 2020

Would the syscall interface go back to carrying only deterministically logged things? If not, what are the other exceptions?

Yes, I think if we moved syscall.drop to some other pathway, then all syscalls would be deterministic again (so the transcript could contain, and check, all syscalls, removing the exceptional case where we must strip out drop syscalls before performing the comparison during replay).

Worst case, a vat which is told to shut down

Is shutting down distinct from terminating?

Yeah, I didn't mean termination, I meant something smaller, in which the vat as a whole keeps running, but the code inside it knows that it's job is complete, so it does as much cleanup as it can (empty all the Stores to facilitate GC) while is still has control. More like emptying out the fridge before a holiday of indeterminate length, rather than burning down the entire house.

It might not be that common/useful of a case. Maybe an auction contract that we keep around for a long time but which only gets used occasionally, and which builds up a lot of references to external objects (which we'd really like to drop) when it does get run. Imagine the contract handles a dozen messages, and on the last one, it knows that this particular invocation is complete, so it drops all those references. If the finalizers don't run until after the vat loses agency, it won't have a chance to tell the kernel that it doesn't need them anymore, and we won't be able to GC upstream until the next auction happens.

If a vat is terminating, why do we care what drops it reports before terminating. Terminating will deterministically drop everything anyway, yes?

Agreed.

@erights
Copy link
Member

erights commented Oct 20, 2020

This reminds me of the other spill-to-disk persistence scaling problem I've been meaning to raise, that it sounds like you've already been thinking about:

Just like we assume that there are a moderate number of active purses at any time, but a zillion idle purses that should not be occupying any ram, we should make the same assumption about vats. That there are a moderate number of active vats at any time, but a zillion idle vats that should not be occupying any ram.

Likewise, most intervat capabilities will be from an idle vat and to an idle vat, and therefore also themselves idle. These idle caps between idle endpoints should not take ram bookkeeping space in the kernel, in comms vats, or in captp-on-ibc. How much of this is already implied by our current plans?

@warner
Copy link
Member Author

warner commented Oct 20, 2020

Yep, the heap-snapshot and worker-process design is meant to support vats being demand-paged into RAM. We'll have the option of loading any given vat as soon as the process starts, as late as when someone sends a message to them, or some point in between. The actual policy/heuristic is an open question: the concern is that the vat might take a significant time to load (replay), and if we wait until the last moment, block processing might stall while we catch up. We might be able to peek at the escalators to make a guess about what deliveries are coming up, and use any leftover time to load the vats while the messages are still climbing. OTOH, if there are any messages at all remaining on the escalators at the end of the block, that indicates we're sufficiently overloaded that we can't keep up (we're now rationing/allocating execution time), so we might not have any spare CPU cycles to load vats. OT3H there will be multiple CPU cores, and we might not be able to use all of them for normal message processing: maybe we wind up with one (serialized) core taking things off the escalator, leaving the other cores available to page in vats which have messages on their way.

The inter-vat caps will consume space in:

  • client Presence (import): RAM, unless it's stored as data in a virtual object which is not active, or the vat as a whole is paged out, in which case disk
  • client vat c-list: disk
  • kernel object table: disk
  • target vat c-list: disk
  • target vat object (export): RAM, or disk for inactive virtual object or inactive disk
  • comms vat c-list: disk
  • comms vat routing tables: currently RAM, but the plan is to move all of that to disk

So I think idle caps shouldn't take up any RAM when we finish implementing virtual objects, paging out idle vats, and comms tables on disk.

@warner
Copy link
Member Author

warner commented Oct 20, 2020

I had an idea last night to address the timing / drop-notification channel issue.

  • the vat worker (the "supervisor" that lives in the same process as liveslots and the vat code, and which handles the quiescence tracking) creates a Set, which will hold vat-side object IDs (vrefs) that are ready to be dropped. It shares this Set with liveslots.
  • the FinalizationRegistry callback adds a vref to this set, which can happen at any moment (to be precise, between any turn), even when the vat is supposed to be idle, because other vats, or the kernel, might do something which triggers an engine-wide GC pass, which finally notices some of the idle vat's objects.
    • we maintain the "vat does nothing when it's idle" property by ensuring that the only thing the callback does is add the vref to the set. This is weaker than before (vat code, specifically liveslots, still gets control when the vat is supposed to lack agency), but still maintains the "no syscalls from supposedly-idle vats" rule which is what we really care about.
  • each time a delivery arrives at liveslots, just after we've unserialized all the references (so now we have strong refs to the Presences/etc cited in the message), we remove all the cited vrefs from the set
  • when the vat becomes quiescent, the vat worker/supervisor code is responsible for sending a "delivery result" back to the kernel (via whatever channel it has to the kernel-process-side vat manager). This code will include a copy of the Set (as an Array of vref strings) in the VatDeliveryResult message (so ['ok', droppedVrefs] instead of just ['ok']). The Set will be cleared at the same time.
  • the VatManager, upon receipt of the VatDeliveryResult, will delete these entries from the c-list and push the corresponding kernel-side reference strings (krefs) onto the maybe-free list, for later processing.

Benefits are:

  • the kernel can continue to believe that vats are idle unless a delivery is active: the kernel will not receive syscalls at random times
  • drop is not delivered through a syscall, so we don't have to change the transcript replay checks to exclude it
  • I think this should correctly handle the race/TOCTTOU hazards. The finalization callback, by careful design of our TC39 heroes, runs in its own turn, so the Set won't be modified during beginning-of-crank delivery processing, nor during end-of-crank processing, both of which happen in a single turn (no await breaks)
    • the vat might lose some objects while it's supposed to be idle, and these need to be reported eventually (the image in my head is: you wake up in the morning, and your slippers are missing, because the GC cat took them away while you were sleeping)
    • the new delivery might cite a vref which the vat lost between deliveries (the Slipperman, like a milkman but delivers slippers, brought you a new pair first thing in the morning)
    • the deserialization of the vref will re-populate the liveslots table: same o-NN key, new Presence object (the full-service slipperman puts the new slippers right back in the place where the old ones were)
      • vat code dropped the old Presence so it can't tell that it's been given a new object (you weren't wearing your slippers when the cat dragged them away, so you won't realize you've got new ones)
      • we must remove the vrefs from the Set at that point (you're no longer missing your slippers)
    • during the crank, the vat might drop the refs again, which means we should add them back to the Set (the cat might steal your slippers during the day)
    • at the end of the crank, we report the Set to the kernel (the, um, Department Of Slipper Auditing, to whom you must report the total number of slippers in your household for tax purposes, wants to know whether they're present or not)
    • there's no way to acquire a new copy of the vref until the beginning of the next crank, so it is correct to drop everything in the Set at this point. There might be more drops between end-of-crank and the next start-of-crank, but there won't be any additions.
  • if finalization callbacks happen promptly, this will free up more objects than a scheme in which syscall.drop is made at the next start-of-crank: we have a larger window to observe the callbacks, allowing us to report the drops earlier.

The drawback is that it still suffers from the late-drop issue above. A vat which stops using a bunch of objects, but for which the finalization callbacks don't occur until after the end-of-crank is processed, will hold on to those references until some future delivery to that same vat. If the vat is now idle for a long time, that's a lost GC opportunity.

Worse, from my experiments it appears that Node.js (at least) decided to make GC notifications ride the IO/timer queue, not the promise queue. So nothing will be added to the Set at all until after our waitUntilQuiescent() call completes. It might be sufficient to add a second waitUntilQuiescent to the supervisor's delivery code, to allow these notifications to run before we sample and clear the Set.

I don't know how GC happens in our JS engines. I'm used to Python, where refcounting causes non-cyclic GC to happen immediately (too fast, in fact, the weakref callback can happen in the middle of some other bytecode operation, so the only safe thing you can do in the callback is to write to a pipe-to-self and notify a safer handler). The mark-and-sweep is only needed to free up cycles, and you can minimize the need for that by manually breaking cycles when your operation is done. In JS, I can imagine smooth performance could be improved by deferring all GC until some later time, even if a refcount drops to zero and you know you could free the memory immediately. If that's the engine's choice, we might have a tradeoff between how long we wait at end-of-crank and how much GC happens in a timely fashion.

@erights
Copy link
Member

erights commented Oct 21, 2020

I don't know how GC happens in our JS engines. I'm used to Python, where refcounting causes non-cyclic GC to happen immediately (too fast, in fact, the weakref callback can happen in the middle of some other bytecode operation, so the only safe thing you can do in the callback is to write to a pipe-to-self and notify a safer handler).

No refcounts in JS so GC never happens immediately by simply dropping a reference. GC might happen in a turn, but our WeakRef design delays all notifications to happen in later turns. As you suspected earlier:

I think this should correctly handle the race/TOCTTOU hazards. The finalization callback, by careful design of our TC39 heroes, runs in its own turn, so the Set won't be modified during beginning-of-crank delivery processing, nor during end-of-crank processing, both of which happen in a single turn

The rest of this seems right. Good stuff!

@erights
Copy link
Member

erights commented Oct 21, 2020

Node.js (at least) decided to make GC notifications ride the IO/timer queue

This does not surprise me. There was a move to have the spec require this, but I think that died. Not sure --- we should check the spec. If it did die, XS may gc more promptly. But how significant is the difference to us? Doesn't it just mean that things might stick around for one more crank than they should? Do we care?

@warner
Copy link
Member Author

warner commented Oct 21, 2020

No refcounts in JS so GC never happens immediately by simply dropping a reference.

Oh, that's disappointing. My initial tests are showing that if I don't provoke Node with a gc() call (under node --expose-gc), then a simply-dropped non-cyclic object does not get finalized "soon", meaning I dropped the reference and waited a few hundred cycles of the timer/IO queue and never saw the finalizer get called.

GC might happen in a turn, but our WeakRef design delays all notifications to happen in later turns.

Not just "later", but it could be much much later. If the engine defers GC until it is feeling memory pressure (hours? days?), and our design depends upon the importing vat getting at least one delivery after the finalizers have run, then big+intermittent vats will be a problem: they might know the imports can be dropped, but they can't safely express it until they get some runtime, and that won't happen until someone sends them another message, which could be days or weeks later, if ever.

This does not surprise me. There was a move to have the spec require this, but I think that died. Not sure --- we should check the spec. If it did die, XS may gc more promptly. But how significant is the difference to us? Doesn't it just mean that things might stick around for one more crank than they should? Do we care?

If the notification happened promptly (e.g. because of a refcount going to zero):

  • if the notification happened on the promise queue, the finalizers would run while the vat is still processing the delivery, so the Set of dropped refs would be known by the time waitUntilQuiescent finishes (i.e. the promise it returned has resolved)
  • if it happens on the timer/IO queue, the finalizers would run just after waitUntilQuiescent finishes, so the Set wouldn't be updated until after the vat had finished with the delivery, so the kernel wouldn't learn about them until the next delivery (which could be days away, for a little-used vat)
    • we could compensate for that by calling waitUntilQuiescent a second time, immediately after the first, to give the finalizers a chance to run before the delivery is complete

But since notifications don't appear to happen promptly, the difference is moot. They'll happen some arbitrarily long time in the future (depending upon memory pressure), and then on the delivery after that point, the vat will have a chance to drop the imports and allow the kernel to do it's higher-level refcounting.

I'm feeling drawn back to dispatch.bringOutYourDead, but I have no idea how frequently it ought to be called.

@warner
Copy link
Member Author

warner commented Oct 22, 2020

Notes from today's kernel meeting:

The VatTP refcounting protocol is designed to correctly manage object-table drops between two asynchronously-coupled systems. Each side maintains a table of counters, indexed by object-id. When A sends a message to B, it increments the counter for every object-id included in the message. When B receives the message, it maps the object-ids to local representatives or Presences by looking at its table. If the table has an entry for the object-id, B increments the counter and uses the old Presence. If there is no such entry, B creates a new Presence and sets the counter to 1. When B learns that the Presence is no longer referenced locally (e.g. through a FinalizationRegistry or equivalent), it sends a DECREF message to A along with the last counter value that was stored. When A receives DECREF(object-id, N), A decrements its counter by N. If the result is zero, A removes the entry from its table (and propagates the drop further upstream, typically by dropping its strong reference on the exported object). If the result is greater than zero, that means a new reference (from A to B) passed the DECREF (from B to A), and we expect B to reconstruct a new Presence.

In swingset, the link between kernel and vat is synchronous. However the link between Presences becoming unreferenced and the finalizers running is asynchronous (because the finalizer runs in some later turn, perhaps much much later, since JS likes to do extremely lazy GC). We can use this same protocol.

The liveslots import table will map object-id to a record containing a Presence and a mutable counter. The finalizer will reference the mutable counter:

if (!slotToVal.has(vref)) {
  const counter = { count: 1 };
  const p = createPresence(vref);
  slotToVal.set(vref, { p, counter });
  registry.register(p, counter);
  return p;
} else {
  const { p, counter } = slotToVal.get(Vref);
  counter.count += 1;
  return p;
}

The finalizer will send a DECREF(counter.count) to the kernel. The kernel's c-lists will contain a count of mentions, incremented for each appearance of that object-id in each delivery.

@erights had a trick for managing the counter correctly despite a large number of mentions (potentially overflowing any reasonable-sized counter value). He used a deliberately small 16-bit counter. When the receiving side got close (perhaps 2^15, there is a race), the receiver would send a DECREF for the old value minus one, and reduce their counter to 1 (not zero). The sender would see the DECREF and reduce their counter with no risk of prematurely dropping the object. This works as long as the rate of wire-crossing messages isn't high enough to risk sending more than the threshold value.

I'm still trying to find a sensible channel for vats to send these DECREFs to the kernel, but this protocol probably makes it safer to send the messages outside the context of a given message delivery. The kernel would react to DECREF by adding the (object-id, decrement-count) pair on the maybe-free list, and then at a safe time (between cranks) it could examine this list and perform the subtraction.

The trouble with DECREFs that can be sent at any moment is how a nominally-blocking vat-worker subprocess should deliver them. I was hoping these subprocesses could operate with a single pipe to the kernel process, spending all of its idle time in a blocking read, either waiting for a new delivery, or for the results of a syscall. But maybe that's not incompatible with sending DECREFs at other times: perhaps the worker subprocess does a blocking read of the kernel->worker pipe, but is allowed to write to the worker->kernel pipe any time it likes. The kernel will be doing a non-blocking read of the worker->kernel pipe anyways (since it supports multiple worker processes), so it could receive DECREFs at arbitrary times. We would need to add some synchronization on shutdown to make sure we don't drop DECREFs as the process is being torn down, though.

warner added a commit that referenced this issue Oct 26, 2020
These two authorities are not part of SES, so they must be pulled from the
globals of the Start Compartment and ferried through the kernel to the vat
manager factory that calls makeLiveSlots.

This gives the outer layer of the vat (liveslots) access to nondeterminism.
We rely upon liveslots to not share this power with the user-level ocap-style
"vat code". Liveslots must never allow user-level code to observe behavior
that depends upon GC activity, because that activity is not part of the
specified input to the vat.

refs #1872
warner added a commit that referenced this issue Oct 26, 2020
Node.js v14 provides `WeakRef` and `FinalizationRegistry` as globals. Node.js
v12 does not (there might be a command-line flag to enable it, but I think
it's marked as experimental).

Rather than require all users upgrade to v14, we elect to disable GC when
running on v12. This change attempts to pull `WeakRef` and
`FinalizationRegistry` from the global, and deliver either the real
constructors or `undefined` to the liveslots code that uses it. We'll write
that liveslots code to tolerate their lack.

refs #1872
refs #1925
warner added a commit that referenced this issue Oct 26, 2020
The upcoming GC functionality will require `WeakRef` and
`FinalizationRegistry`. Node.js v14 provides these as globals, but v12 does
not (there might be a command-line flag to enable it, but I think it's marked
as experimental). Rather than require all users upgrade to v14, we elect to
disable GC when running on v12.

This adds a local `weakref.js` module which attempts to pull `WeakRef` and
`FinalizationRegistry` from the global, and exports either the real
constructors or no-op stubs.

refs #1872
refs #1925
warner added a commit that referenced this issue Oct 26, 2020
These two authorities are not part of SES, so they must be pulled from the
globals of the Start Compartment and ferried through the kernel to the vat
manager factory that calls makeLiveSlots.

This gives the outer layer of the vat (liveslots) access to nondeterminism.
We rely upon liveslots to not share this power with the user-level ocap-style
"vat code". Liveslots must never allow user-level code to observe behavior
that depends upon GC activity, because that activity is not part of the
specified input to the vat.

refs #1872
warner added a commit that referenced this issue Oct 26, 2020
In the future, when liveslots implements GC and discovers that a Presence has
ceased to be referenced by the user-level vat code, it will call this
function to decrement the kernel's reference count for the imported object's
c-list entry.

`vatDecRef` might be called at any time (although always in its own turn).
The kernel will eventually add the decref information to a queue, to be
processed between cranks. For now, the kernel only records the information if
an option was set to enable it (for future unit tests).

Most of this patch is the kernel-worker protocol wiring to allow
child-process vat workers to deliver the decref back up to the kernel
process.

Liveslots does not use this yet. A future patch will switch it on.

refs #1872
@warner
Copy link
Member Author

warner commented Dec 5, 2020

GC of imported Promises

As mentioned, I'm still thinking through how this relates to Promises, rather than objects (Remotable on one side, Presence on the other). The question is what happens when a Promise gets imported/exported through the vat/kernel boundary, and then gets forgotten by one side or the other.

In general, we hope Promises to be easier to manage than objects, because 1: most Promises eventually resolve, and 2: we aren't obligated to maintain identity for resolved Promises. As a result, when a resolution is sent across the vat/kernel boundary, we also "retire the vref" (i.e. the vpid). This removes it (formally) from the vat's c-list, as well as the internal liveslots table. "The matter is finished, the question is settled, let us not speak of this again", which is enforced by the fact that both sides delete all record of the vpid so they can't bring it back up. The kernel-side promise table entry (the kref, i.e. a kpid) remains present until all other formal references to the kpid are removed, at which point it can be formally GCed and the table entry deleted. If the kpid is sent back into the vat before it can be deleted, the kernel will create a new vpid to describe the (now-resolved) Promise, add a new c-list entry (with the new vpid and the old kpid), and will send a resolution into the vat immediately following the message that referenced it.

This gives us more opportunities for formal GC than for objects, because we're allowed to delete the vref even though userspace might still hold a reference to the JS Promise object. But we still need to understand what happens when vat code stops referencing a Promise, and what the kernel observes. We're using the same slotToVal structure for everything, so we need to know what that WeakRef will see.

JavaScript Promises And Their References

The ECMAScript specification on the Promise constructor (and on Promise instances) is intentionally silent on the question of which objects retain strong references to which others, because overspecification would limit engine implementators' ability to add performance optimizations. In addition, Promise can be subclassed, so overridden methods might want access to things that the default implementation does not, introducing more references.

There are basically four things of interest: the Promise itself, the resolution functions (resolve and reject, as handed to the "executor" callback passed as new Promise(executor)), any callbacks provided to Promise methods like .then/.catch/.finally, and the value to which the Promise is ultimately settled.

There are two references we can assume exist based upon their importance to keep Promises working as expected:

  • If you have access to a settled Promise, then you can use .then to obtain access to the settlement value. So the Promise must keep a strong reference to this settlement.
  • If someone has used .then to attach callbacks to an unresolved Promise, then anyone with access to resolve/reject can cause those callbacks to be run, regardless of whether the Promise is still reachable. So either resolve/reject must keep a strong reference to the list of unexecuted callbacks, or they must references something else which does the same.

I ran some WeakRef experiments with Node.js (version 14.8.0) to deduce where the internal references seem to lie:

promise-gc

In this diagram, the dashed lines (A -> B) mean there appears to be a strong reference from A to B (i.e. B will not be GCed unless A is GCed first), but the specification doesn't appear to require one. Note that this does not mean someone holding A can get access to B: the strong reference is internal. The solid line indicates both a strong reference and that the holder of A can get to B.

It appears that the resolve/reject functions hold a strong reference to the Promise object itself, whether in a resolved or unresolved state. And an unresolved Promise seems to hold a strong reference to any .then/etc callbacks which have been added (these refs go away when the Promise is settled and the callbacks are invoked). A settled promise holds a strong reference to the settlement object, as expected.

As we build something around this, we must be prepared for the dashed edges to be missing, because engines are not required to maintain them. For example, there's no obvious reason that the resolve function needs to keep the Promise alive: it needs to keep the queue of callbacks alive, but not necessarily the Promise itself.

There is no obvious way to sense when .then has been called, which introduces a requirement described below, and is unfortunate because the sendOnly pattern is very common, and if we could sense that a Promise was dropped without ever having a .then function called on it, we could GC the Promise object sooner. (Note that one of @michaelfig 's source-source transforms for the tildot operator would parse the code to spot this case, and use a different HandledPromise invocation, which would solve the problem at a different layer).

Terminology

For objects, which have a single clear exporter (and zero or more importers), we have three nouns: Remotable, Kernel Object, and Presence. The first and last are used in the exporting and importing vat, respectively. They are both vat-code -visible JS Object instances that coordinate with the more-abstract kernel-side entity, through syscalls and deliveries. Within the kernel, the "kernel object" is not represented by an actual JS Object (remembering the kernel might not even be written in JavaScript): instead, it is represented by a row in a database table, and referenced by short strings (kpid values, like ko24) rather than references in a JS object graph.

For Promises, the situation is ever so slightly murkier. Promises have a single creator, but its identity isn't very interesting. Instead, message routing depends upon the current decider, which can change over time as resolution authority is passed from a vat, into the kernel, and off to some other vat, before being consumed in an act of resolution. The abstract "promise" is represented in all vats by an actual JavaScript Promise object, plus a variety of liveslots tables which coordinate its behavior with the kernel. User-level vat code can interact with this Promise as if it were the same Promise that every other vat was using (modulo things like resolved promises losing their identity). As with objects, within the kernel there is a database table entry to manage the promise state, rather than a real JavaScript Promise object.

A vat might create a native Promise object and then export it into the kernel (as the argument of some message delivery, or inside the resolution of some other promise). In this case, liveslots must arrange to notify the kernel when the local vat resolves that Promise. Vats also create a result Promise in the process of sending a message through the kernel: the liveslots code that implements HandledPromise.applySend creates and returns a Promise instance. In this case, liveslots must prepare for the kernel to resolve the promise, and cause the Promise object's resolve/reject callback to be invoked at the right time. Vats might also import a promise from the kernel (in arguments), which prompts liveslots to create and deliver a Promise just like it would for the result of an outbound message send.

Vats Which Import a Promise

When a vat imports a Promise, convertSlotToVal notices the new vpid and uses makeImportedPromise(slot) to build a new HandledPromise for delivery to the user-level vat code. The resolve/reject functions are stored in a "pRec", which is itself stored in the importedPromisesByPromiseID table (a strong Map). The Promise itself goes into slotToVal (a Map) and valToSlot (a WeakMap), just like a Presence would be during object import, so that future inbound references to the promise (via slotToVal) are translated into the same Promise object, and so future references this vat makes in outbound messages are translated (via valToSlot) into the same vpid vref.

Later, if/when the kernel notifies the vat of the promise being resolved, dispatch.notifyFulfill* looks up the pRec and invokes the resolve or reject function.

When we change slotToVal to map to WeakRefs instead of strong references, Promises will be treated the same way (so convertSlotToVal doesn't need to special-case anything). We must ensure that Promises are kept alive, when appropriate, using something comparable to the exports set proposed above for objects.

I'm still working things through, but maybe we never put vpids in the Dead Set, and never do local-only GC of promises, because I don't think JavaScript gives us enough information to know if/when that would be safe (but I could be wrong). And I'm hoping that basically all promises get resolved, so formal GC can happen, and we won't be left with enough garbage to be a concern.

The overall structure will look something like this:

GC of imported promises

We'll have a finalizer on the Promise, but its main job is to delete the slotToVal entry (freeing the WeakRef). I don't yet know if it should put the vpid in the Dead Set. I do know that any vref that gets formally deleted should be excluded from the Dead Set, so when we retire the vpid (because the promise was resolved), we should no longer pay attention to the finalizer firing.

User code might receive an imported Promise, use .then to attach a callback, and then forget the Promise. If nothing else is holding on to it (remembering that slotToVal is now weak), the Promise would be collected. This by itself would be fine, because nobody still needs the actual Promise, but we do need to ensure that the callback will be invoked eventually. We cannot sense that a callback has been added, so we must arrange for resolve or reject to be called unconditionally.

To this end, we need the importedPromiseByPromiseID table to be strong Map from vpid to the pRec, which will in turn provide a strong reference to the resolve and reject functions. As we saw above, this happens (at least in Node.js) to keep the Promise alive, but more importantly it almost certainly keeps the .then callback list alive, which is what we really care about. If importedPromiseByPromiseID were mapping to a WeakRef (like slotToVal does), perhaps directly to the resolve callback function, then we could be left in a situation where the kernel notifies the vat about a resolution, but liveslots could no longer reach the resolve() it needs to act upon it.

When the promise is resolved, we can delete the importedPromiseByPromiseID mapping. At that point, the Promise will only be kept alive by vat code keeping its own strong reference.

(TODO: investigate what happens during the interval between resolve() being called and the turn where the .then callback is invoked: what does the microtask queue keep alive?)

I'm still thinking through the exporting side, I'll write that up next.

@warner
Copy link
Member Author

warner commented Dec 8, 2020

GC of exported promises

GC of exported promises

On the export side, I think we need to add the Promise to the strong exports set, as we've planned to do for Remotables, but we won't expect to be able to drop the Promise before it is resolved (at which point we can formally retire it and drop it properly).

When user code creates a Promise and sends it in a message (or in the body of resolve-to-data), the act of serializing the as-yet-unseed Promise object will trigger liveslots to call exportPromise. This will use .then to add our thenResolve and thenReject callbacks, which will do some flavor of syscall.resolve when invoked, to inform the kernel that the vat has resolved the promise (instructing the kernel to notify any vats which imported/subscribed to the promise, as well as deliver any messages queued to the promise in the interim). Liveslots also adds the Promise to the valToSlot WeakMap, and will add a WeakRef to the Promise to slotToVal, just like it would for Remotables. Finally, we'll arrange for the Promise to be added to the exports Set, just like Remotables.

It's likely that user code will forget about the Promise immediately after sending it, only retaining access to the resolve and/or reject functions for later. As we've seen above, in practice this means the Promise will be kept alive as well, but we don't want to rely upon that. I'm pretty sure that the then callbacks would be kept alive, which means even if the Promise goes away, the kernel will still be notified when it gets resolved. However, if resolve/reject doesn't keep the Promise alive, and we didn't add the Promise to exports, then we could get into a situation where the Promise was GCed, removing it from slotToVal, which means if any other vat re-delivered the Promise into this vat, it would be treated as a new Promise. That's not ideal, although I can't immediately think of what problems it might cause.

When will exported Promises be dropped?

On the importing side, we must hang on to the Promise (importedPromisesByPromiseID must be a strong ref) until it is resolved, because we can't prove that the user code didn't call then and attach a callback. Our experiments show that resolve/reject keeps the Promise alive, and the Promise keeps the .then callbacks alive, but this isn't guaranteed by the spec. If we were to delete the importedPromisesByPromiseID record when the Promise was GCed, we might fail to resolve the promise when the kernel instructed us to, and userspace's callbacks might never get run.

So, barring some advances in the JS Promise specification (or an importing vat which runs something other than JS), vats have the ability to do a local drop of their imported promise vref, but they'll never use it.

As a result, although we'll probably implement the ability for vats to be told that the kernel no longer (locally) cares about their promise export, we don't expect it to be used. If implemented, this would simply delete the vref from the exports set. Later, if nothing else in the vat is keeping a strong reference to it, the Promise will be collected as usual, and eventually removed from slotToVal and valToSlot.

Instead, we expect to retire Promises if/when they are resolved. The syscall.resolve message (which is being refactored to enable multiple Promises to be resolved in a single batch: #2065), after serialization is complete and the syscall is delivered, will delete the resolved vpid values from the slotToVal table, and will delete the Promise itself from valToSlot. We must remove it from exports at the same time. With that accomplished:

  • 1: both vat and kernel can formally delete the c-list entry for the Promise, which might enable the kernel to formally delete the promise record itself (potentially freeing other objects, as usual)
  • 2: if the (now resolved) Promise is ever re-exported, it will be given a new short-lived vpid, a new syscall.resolve will be sent to re-inform the kernel of its resolution, and then the new vpid will be retired too
  • 3: if user code forgot about the Promise, and the code which invoked resolve or reject goes out of scope (dropping resolve/reject), then there should be no strong references left (we deleted the one from exports). There won't even be a WeakRef left, because we delete the slotToVal entry. So the Promise should be GCed shortly, and nothing in the vat will remember either the vref or the Promise (which hits our memory-usage goal).

@warner
Copy link
Member Author

warner commented Dec 8, 2020

Task List

  • add the dead set (perhaps with a better name)
  • change slotToVal to use WeakRefs
  • add exports Set, change convertValToSlot to add Remotables/Promises to it
  • export/import should remove the vref from the dead set too
  • add Finalizer to imports and exports, giving it the vref string as the associated data. The callback function should examine slotToVal for a live WeakRef, and otherwise add the vref to the dead set. Dead weakrefs should be deleted.
  • add a notification pathway into the kernel that is safe to use at arbitrary times, which queues the vat-id that might have something in the dead set
  • add a phase to the kernel cycle that queries all queued vat-ids for their dead set (introduce a dispatch-like pathway from kernel to vat, which is not recorded in the transcript)
  • add the two refcount fields to kernel object/promise table entries
  • add the local-in-use field to the c-list entries
  • implement the function (described above) to update the c-list entry of the importing vat which has just locally dropped the import
  • implement the kernel table GC sweep algorithm
  • add the kernel-to-vat delivery (non-transcript) that informs the vat of a locally-dropped export
  • probably more

@dtribble
Copy link
Member

dtribble commented Dec 8, 2020

A naming issue: Please use "revoked objects" rather than "cancelled objects". In most systems, "cancel" generally means "stop making it", with a focus of "asynchronously stop consuming resources". Whereas "revoke" means "make it unavailable, preferably synchronously or in order". As a result of canceling, things might get revoked, but that's just one of the actions that could happen as a result of cancelation.

@warner
Copy link
Member Author

warner commented Dec 8, 2020

how exports are dropped

I think dropLocalExport(vref) and dropFormalExport(vref) do the same thing, and the only difference is that the local form doesn't get added to the transcript. Both will:

  • use slotToVal(vref).deref() to get the object (Remotable) being dropped
  • remove the Remotable from the valToSlot WeakMap
  • remove the Remotable from the exports Set
    • this derefs the Remotable, allowing it to be collected if the vat code isn't retaining its own reference
  • remove the vref from the slotToVal Map
    • this derefs the WeakRef, allowing it to be collected

At some point in the future, the Finalizer will fire. The Finalizer callback should do the following:

  • slotToVal.get(vref): if this is undefined, the object was already deleted, so return without additional effort
    • if this yields the WeakRef, query it with wr.deref()
      • if this returns truthy, the object was reborn, and is not dead, so return without additional effort
      • if this returns undefined, the object is dead, and we are the first to notice. Add the vref to the dead set, delete the vref from slotToVal (freeing the WeakRef), and notify the kernel

We use the same WeakRef and Finalizer setup for everything in slotToVal/valToSlot, which means both exported Remotables and imported Presences. The only difference within the vat is that Remotables are also referenced by the exports Set, while Presences are not. This keeps the Remotable alive until a dropLocalExport or dropFormalExport deletes it from exports.

(Another option is to keep a handle for the Finalizer around, and have the delete-from-everything code use it to proactively unregister the Finalizer, rather than ignoring the finalizer when its vref is already gone, but that'd be marginally more complicated)

@warner
Copy link
Member Author

warner commented Dec 10, 2020

At yesterday's meeting, @dtribble reminded me that exported Remotables must keep the same vref independent of local GC behavior. We'd worked that out a few weeks ago, but it dropped out of my most recent writeup. To maintain that property:

  • When informing a vat that an export is no longer needed, the kernel clears the local-in-use flag of the clist entry after translating the derefenced kref into the vat's vref. If the kernel is calling dropFormalExport instead of dropLocalExport, it deletes the c-list entry entirely.
  • The only thing dropLocalExport/dropFormalExport does is to remove the Remotable/Promise from the exports set; it should leave slotToVal and valToSlot untouched
    • (and the only difference between dropLocalExport and dropFormalExport is that the latter goes into the transcript)
  • If/when the vat code drops the Remotable itself, eventually the finalizer will run. Its job is to delete the slotToVal entry (cleaning up the now-dead WeakRef) and notify the kernel.
    • The exported Remotable's old vref (which will always be in the form o+NN rather than the o-NN used for imports) goes into the Dead Set, and the kernel eventually retrieves it (along with dead imports).
    • The kernel checks each dead vref against the c-list. If the local-in-use flag was already clear, then this is not a newly-dead object, and we don't need to do any GC processing on it.

So vats treat Remotables and Presences the same, except that Remotables go into the exports Set, and Presences don't.

If the vat code drops the Remotable before the kernel calls dropLocalExport, the Remotable will be unreachable from the vat code, but still weakly reachable through slotToVal. However, nothing in the kernel will ever mention it again, because that would require something in the kernel or other c-lists to still have a local reference, and dropLocalExport isn't called until the last of those go away. At that point, there will be no strong references left, so eventually the Remotable will transition into "collected" (making it unreachable even through the now-dead WeakRef), and then into "finalized" (causing the finalizer to run, which frees the WeakRef itself).

If instead the vat code retains a reference to the Remotable and re-exports it (after being told to drop the export, the vat will continue to use the same vref as before. This retains a one-to-one relationship between Remotable and vref, regardless of GC behavior. If the export was dropped through dropLocalExport, the c-list entry will still be present (merely the "in local use" flag will be false), so the translation will re-use the old kref too (and sets the flag back to true). Only if the export was dropped through dropFormalExport (i.e. the client vat died) will a new kref be allocated. The difference between "use old kref" and "allocate new kref" depends upon a formal drop, which is not a consequence of GC, which is important because the krefs in use are part of the formal consensus state.

@warner
Copy link
Member Author

warner commented Dec 15, 2020

Sufficiently Deterministic GC

We don't want to overconstrain validators by requiring them to run an exact binary image (and diversity of validators is critical for security anyways, especially against supply-chain attacks, so we very much want each validator to compile their own code from verified upstream sources). But JS is sufficiently underspecified that we need some constraint to make sure that adversarial contract code cannot cause a consensus fault. E.g. we require that all validators run a JS engine whose sort() behavior is observably indistinguishable. We expect this means we'll pick some major version of XS (for which we expect all minor versions to have the same observable behavior), and require that all validators match. In practice, this probably means all validators will compile their own copy of XS from within that major-version range.

Given that existing constraint, we think we can rely upon GC to behave the same across all validators, especially if we slightly weaken our requirements. The idea @dtribble and @erights and I talked through today is:

  • we insist that the engine give us (i.e. the dispatch unit) a way to explicitly perform a precise+complete GC pass
    • Node.js has an --expose-gc flag which adds gc() to the start compartment. In combination with several setImmediate/setTimeout calls to allow the finalizers on the IO queue to run, this might be sufficient, although we don't know what sort of guarantees Node makes about completeness
    • in XS, we expect a C function that accepts the XSMachine structure to trigger this
    • by the time this GC pass completes, all finalizers must be run, and all WeakRefs to unreachable objects must .deref() === undefined
    • we require that the set of now-dereferenced objects be the same among all validators, however the exact order in which finalizers run does not need to be the same
  • we define a consistent point in the kernel cycle to trigger GC, perhaps at the end of each block, on each vat which was touched during that block
  • we use the same bringOutYourDead method as above to collect the newly-unreferenced vref imports, however now the results are formal, which means bringOutYourDead can be a regular transcript-tracked dispatch
  • we remove the distinction between "formal" and "local" GC: all references are formal
    • this lets us remove the local refcount, and the "local-in-use" flag
    • dispatch.dropExport is now formal
      • liveslots reacts to dropExport by deleting both directions of the valToSlot/slotToVal tables
      • if/when the vat re-exports the same Remotable, it will get a new vref, and thus a new kref

By making an explicit gc() call, we hope to make it easier for engines to maintain the same observable behavior: they can still have different heuristics as to when GC happens naturally (in response to memory pressure, etc), but we force everything to be collected just before the one point when we sense the results (via bringOutYourDead). Finalizers might run early, but we don't observe their consequences until bringOutYourDead.

We might require that finalizers don't run until the explicit gc() call, I think this needs more analysis. Our finalizer callback certainly shouldn't notify the kernel: it should only update the dead set. But I need to think about what happens if a vref is re-imported after its weakref is dead, and whether that would create observably different behavior.

@warner
Copy link
Member Author

warner commented Jun 21, 2021

We landed the changes for #3106 and #2615, and we're mostly not trying to pursue a non-deterministic form.

I'm going to leave this open until I can file a ticket with my notes about the remaining non-deterministic handling that we need to do, which is when we take a transcript for a vat that was run under xsnap, and replay it (for debugging) in a local worker, and it's GC behaves differently.

@erights
Copy link
Member

erights commented Jun 22, 2021

Do you expect the gc differences to be observable to unprivileged user code? Is there any way in which gc on is observably different from no gc, to unprivileged user code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-design SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

5 participants