-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
per-run metering limits to terminate contracts causing fast cross-vat infinite loops #7938
Comments
Oh, note that we can use a regular meter (instead of an |
I was thinking of using regular meters in the first place. For the |
I looked at the mainnet chain today, and found that all contracts vats are already non-critical. @dckc pointed out that we don't even give
Also, we're currently only launching new contracts via a CORE_EVAL proposal, which evaluates code snippets in a scope that includes access to Zoe (and optionally We still need the Zoe API to be able to pass through a Meter, and we should probably hold on to the new Meter somehow so we can do something with it later (once we stop auto-refilling them). |
I looked at the mainnet slogfiles to find the largest deliveries (computron count), to see if we'll have any problems with a 100Mc per-delivery limit. I found 55 deliveries with >50Mc, 45 with >90Mc, and 14 with >100Mc (the largest was 128Mc). If any metered contract vat were to spend that much in the new world this ticket creates, that vat would be terminated. Most of these happened during the bootstrap block, but the exceptions are a The largest, 128_182_169c, was the So, we need to make sure the per-delivery limit is larger than 128Mc:
Our per-block limit (the |
My plan is to augment To use this argument, vat-vat-admin must be upgraded, since the old version (in swingset 0.32.2 and earlier) does not pass it through to the vat. Providing a |
I'm confused, why do we need a new argument? Wouldn't the |
Oh, interesting. That would certainly be a smaller change: store (we need to know There's certainly no utility in setting The original idea around Meters was that they'd suffice for weeks of normal operation, and contract instance owners would get notified if But.. if So whatever contract-owner -facing controls we're going to put on Meters, we aren't going to let them pick their own If we were being strict, the Which means we have to set If that were also the initial value of Effectively, : async function createMeter(remaining, threshold, deliveryLimit) {
return E(vatAdminService).createMeter(remaining, threshold, deliveryLimit);
} would be implemented as: async function createMeter(remaining, threshold, deliveryLimit) {
assert(remaining > deliveryLimit);
const meter = await E(vatAdminService).createMeter(remaining, threshold);
await E(meter).addRemaining(remaining - deliveryLimit);
return meter;
} In the long run, contract owners will not be creating their own Meters, they'll have a meter imposed upon them by the contract-launching service, so this pattern could be hidden by that service. I guess it comes down to this: we could avoid an API change (and a consequential need to upgrade vat-vat-admin) by making the use pattern slightly weird. The simplest way to call it, calling just Not the worst tradeoff to make, but probably one to be deliberate about. |
To be sure I follow, today the Yes in a perfect world, this limit would be dynamic, so that we could have a high limit for the initial delivery (and maybe for some deliveries like BOYD), and a more reasonable limit for normal deliveries, as you say likely of the same magnitude as the per block limit. However I don't believe we need this yet. As far as I can tell, the per delivery limit can remain predefined (but recorded in the vat options), and we don't need to introduce any changes that would require a vat vatAdmin upgrade for this feature. The only vat that should require an upgrade is Zoe to accept a meter when instantiating a vat. |
Correct, in current trunk, we either tell The biggest win of a dynamic limit would be to set it equal to The main reason I wanted the Having the delivery limit from from the So the core options are:
|
My vote is for 1) as I believe it satisfies the requirements for this issue and does not prevent switching to something like 3) in the future. |
@mhofman : https://github.com/Agoric/agoric-sdk/tree/7938-reset-all-meters has a start on |
From my pismo replay slog (using the version of XS in pismoC + patch to treat weakref strong), I have the following delivery results:
So clearly 200M computrons is enough for a deliveryLimit. |
Add a feature which allows the host application to reset all non-unlimited Meters to a new absolute value. refs #7938
I think we've settled on a plan:
This means:
As @mhofman recommended, we can avoid tying Meters and |
We really need to think whether that is sufficient. The terminated contract may have generated assets of its own, which would no longer be reachable and this become worthless. We likely need to implement #3528 to be safe against all runaway prevention actions. |
What is the Problem Being Solved?
We need to protect the chain against runaway contracts. For now, we've broken the problem into three categories:
timer.poll()
), but then the host reaction triggers the same vat callTo address the second category ("fast cross-vat loops"), we want to kill the misbehaving vat when it's used too much time.
We originally designed a
Meter
scheme where each vat gets a limited number of computrons, is terminated if the meter underflows, and relied upon something (Zoe, acting on behalf of a contract owner) to refill the meter periodically, requiring payment in a fee token to limit the consumption it enables. We implemented all of this, but backed away from it because the "contracts as small businesses" model didn't feel mature enough, or sufficiently easy to use.Today, @mhofman had an idea: as a short-term fix, rather than doing pay-for-usage, or even a basic computrons-per-time rate limit, we can use a computrons-per-"run" limit. Contract owners wouldn't need to do anything special, but contracts which don't finish their work within a few blocks would be terminated.
Description of the Design
First, a description of our current tools:
The Swingset
vatAdminService.createVat()
API accepts ameter
option, taking objects created byvatAdminService.createMeter()
. Each meter has a current capacity, which can be increased or decreased by theMeter
holder. Any vat associated with a Meter (there can be multiple) will cause the capacity to be reduced as it executes: at the end of the delivery, however many computrons were consumed by thexsnap
worker will be deducted from the Meter. If that brings the capacity below zero, the vat is terminated and the last delivery is unwound. The parent vat will observe thedone
promise get rejected with a "vat terminated" message. If we had meter notifiers enabled (we ripped them out during the durablization effort), the subscriber would be told that the meter was exhausted.Swingset exposes
controller.run(runPolicy)
to the host application. TherunPolicy
basically limits execution to a maximum number of computrons: the kernel will perform deliveries until either the run-queue is empty, or the total computrons executed exceeds the policy limit. We (cosmic-swingset, in a governable parameter) picks a computron limit intended to reflect maybe 5 seconds of execution.For every block, cosmic-swingset performs the following sequence:
controller.run()
, to try and finish any "old business" still on the run-queue from the previous blocklabel A
, referenced below)controller.run()
timer.poll()
(wake up timers), which might push work onto the run-queue, and then do anothercontroller.run()
controller.run()
Our proposal is to:
controller.resetAllMeters(computrons)
API to the kernel, which resets all meters to the given limitcontroller.resetAllMeters(10 * perBlockComputronLimit)
at the point markedlabel A
, just after draining the run-queue of old business, just before pulling any new work from the high-priority action queueUnlimitedMeter
, but the capacity will be reset at the next block, so it doesn't matter too much. However, see below about changingUnlimitedMeter
values after creationxsnap.js
, but it is set per-worker, not per-delivery, making it difficult to change for pre-existing vats)With this in place, normal execution won't notice anything different: they'll reduce their meter value a little bit during execution, their meter gets reset a block later. Zoe doesn't need to do anything with the Meter after creation (zoe is not responsible for refilling it), nor does Zoe need a notification that it has fallen below a threshold (which is good because we ripped them out).
But imagine a contract that's engaged in a fast cross-vat loop with Zoe or vat-timer. For concreteness, let's say that the per-block limit is 1M computrons, and assume both vat-zoe and the contract vat are spending 100k computrons per cycle.
On the first block that involves this contract, we'll see 5 iterations of this loop before the
controller.run()
will hit the runPolicy's 1M per-block computron limit, leaving some messages on the run-queue, and ending the block. The contract vat will have consumed 500k computrons, so it's meter will have 10M-500k = 9.5M left. vat-zoe will have consumed the same, but it isn't metered, so there's nothing to deduct.Now, on the next block, we'll start with the "try to finish old business" phase, and it will exhaust the runPolicy too, finishing the block just like before, leaving the contract vat with 9.0Mc left on its meter. This will go on for 20-ish blocks, at which point the contract vat's Meter will be exhausted. The contract vat will be terminated, zoe will do it's normal cleanup (returning escrowed funds). The run-queue can drain, leaving some per-block budget left, so cosmic-swingset can pull the next piece of new work from the action queue, and useful activity resumes.
If we call
resetAllMeters
withA
times the per-block limit, and each cycle of the loop sees the contract vat use a fractionB
of the total computrons involved (e.g.B = 0.5
for a loop that involves only vat-timer and the contract vat, and where both vats do equal amounts of work), then this will kill the contract vat afterA / B
blocks. On the other side of the scale, a contract vat which just needs a lot of time to finish its job will get at leastA
blocks worth of CPU without the threat of termination.When we move to a more sophisticated metering scheme, we'll remove the
resetAllMeters
call, and find some other way to provide computrons to the contracts that were relying upon that source.rejected plans
Some approaches that came up but were too complicated:
Security Considerations
This increases the consumption ability of new contracts, since previously all contract vats were unmetered, so we did not enforce a per-crank delivery limit. However it also removes the critical-vat flag from new contracts vats, removing their ability to panic the kernel (and thus halt the chain).
The design described above assumes that we're no longer creation any critical/unmetered contract vats. If we want to retain that ability, we'll need some option passed into
zoe.startInstance()
, and some authority mechanism to gate access to it (similar to thecriticalVatFlag
used to enable the creation of a vat that will panic the kernel if it terminates).Deployment Considerations
This requires changes to:
controller.resetAllMeters()
API)resetAllMeters()
)We can deploy the kernel and cosmic-swingset changes at the same time, in a chain-software upgrade (governance proposal, vote, chain halt, software upgrade, chain resumption).
To deploy the Zoe change, we'll need a core-eval proposal which uses bootstrap's retained vat-zoe
vatAdminFacet
to perform a.upgrade()
to a new Zoe bundle (which must be pre-installed on the chain with avalidateAndInstallBundle
txn). We're still figuring out how to do this best: we might arrange for the chain software upgrade to inject this core-eval action as the first event after the upgrade. Note that if the run-queue was not empty before the chain upgrade, this zoe upgrade may not happen right away (especially if the chain is already in a fast-loop situation, which would require special surgery to deal with).There are no problems if the kernel/cosmic-swingset change happens first, and the zoe change doesn't happen for a while. In that situation, cosmic-swingset would reset all meters on every block, but there wouldn't be any meters to change, so nothing special would happen.
If somehow the zoe upgrade happens first, then we'd have metered contract vats with nothing to refill their meters. These vats would run for a while, until whatever initial capacity was used up, then they would terminate. It may be best to have zoe create
UnlimitedMeter
s, instead of meters with a fixed capacity. These contracts vats would remain unlimited until the chain-upgrade happened, after which the "unlimited" value would be reduced to the 10*block value (also, if we do that, then we don't have to hard-code a particular value into Zoe, nor add a mechanism to change it later).Unfortunately
UnlimitedMeter
s are not currently prepared to have their capacity changed. To fix that, we need to upgradevat-vat-admin
to change the implementation. That will require acontroller.upgradeStaticVat()
, which requires new code incosmic-swingset
(because core-eval, running inside a vat, cannot reach thecontroller
object that controls the kernel itself).So the total upgrade sequence might be:
controller.upgradeStaticVat()
, to change vat-vat-adminE(vatInfo['zoe'].adminFacet).upgrade()
with the just-installed bundleScaling Considerations
resetAllMeters
would iterate through allkvStore
keys that matchm\d+
, using a prefix/range-basedgetNextKey
loop. In the long run, we would probably want a faster way to find the appropriate subset of meters to change. But, we'll probably remove this feature (in favor of something more sophisticated) before that point.Test Plan
Kernel unit tests for
controller.resetAllMeters()
.I don't know how to test the upgrade handlers.
We should add a new test (not sure where it should live) of trunk code surviving a contract vat that goes into a fast cross-vat loop, e.g. after the third offer it executes.
The text was updated successfully, but these errors were encountered: