Skip to content
This repository has been archived by the owner on Dec 7, 2024. It is now read-only.

Concern with un-decoded bits / alternative for SIMD #11

Open
lukewagner opened this issue Mar 22, 2022 · 14 comments
Open

Concern with un-decoded bits / alternative for SIMD #11

lukewagner opened this issue Mar 22, 2022 · 14 comments

Comments

@lukewagner
Copy link
Member

I'd like to lay out a concern regarding feature-detection along with a way to address this concern while still addressing the needs of SIMD (which as @tlively was saying in his last presentation, seems to be our only short-term need for feature-detection).

The concern is that, despite our best intentions in the short-term, if we allow wasm modules to validate with completely un-decoded bits, then over the long-term (as wasm gets more toolchains and engines and business contexts), we'll end up with a lot of wasm modules containing non-standard bits. While it is certainly possible today for particular toolchains and engines to agree to produce/consume non-standard/pre-standard bits (which is actually an essential part of the process of standardization), there is a natural check-and-balance in that these non-standard/pre-standard bits will only run on those particular engines and fail to validate on the rest. If it becomes possible to produce modules with non-standard bits that validate on all wasm engines, then it becomes much easier to push out non-standard bits and circumvent the standardization process in general. As with most vague concerns, maybe there comes a time when this is a risk we want to take, but I think it's a good idea to hold off on crossing this Rubicon for as long as possible.

If we zoom in on the SIMD use case, I think a reasonable alternative solution makes sense (which is slightly different from what I've suggested before) based on the observation that the engineering-effort and engine-size required to decode, but not validate or compile, SIMD (or any other future wasm feature for that matter) is miniscule compared to the overall effort required to implement MVP wasm 1.0, and thus it's not worth trying to optimize away SIMD decoding on engines that don't support SIMD (classic Amdahl's Law).

Based on this observation, I think we can derive a reasonable compromise:

  • For any feature that needs to be permanently un-implemented on some engines or hardware/configurations, we specify a feature flag, testable via some instruction that returns whether that feature is "available" (as an i32 value).
  • When a feature is unavailable, the static and dynamic semantics treat all instructions of that feature as-if they were immediately preceded by an unreachable instruction, and thus they are decoded as dead code.
    • I'm assuming here we adopt Conrad's relaxed-dead-code-validation proposal so that this dead code is really-truly only decoded, not validated-in-fun-ways as it is today.
    • If a proposal introduces a new type, then because all the instructions that can observe the values of that type are unreachable, the type doesn't actually need to be implemented with runtime values. E.g., engines with SIMD unavailable wouldn't need to implement v128 with an actual 128-bit value -- they could use nothing or a dummy.
  • We adopt a general expectation in the CG that, once a wasm feature is standardized, all engines either implement it fully or implement it as unavailable -- but either way the engine doesn't permanently leave the feature as failing to decode.
    • Noting that an engine could initially implement a feature as unavailable and then later make it available as the feature was prioritized or the landscape changed.

Some other nice technical benefits from this approach are:

  • This seems relatively easy to specify and implement.
  • This doesn't add any complexity when dealing with code offsets in custom sections (WebAssembly DWARF, names, ...).
  • This doesn't introduce a general new mechanism that needs to be proposed as its own separate feature -- the next SIMD proposal could just do this for SIMD and avoid the scope creep of a more-general feature.
  • This establishes a path forward for adding additional SIMD instructions outside the current maximally-portable intersection: new instructions could be grouped into arbitrarily-fine-grained features, allowing the toolchain to simply branch in normal code.

With respect to the other major concepts of "version" and "profile":

  • There may still be reasons for defining different profiles (e.g., a "deterministic" profile still makes a lot of sense for unrelated reasons), but I think we'd try to minimize this and prefer defining unavailable features (all in the same default profile).
  • A monolithic, monotonically-increasing version would make sense to capture the set of features (always- and potentially-available) that a toolchain could expect to emit without validation error. Even when assuming version X which contains feature Y, if Y is specified to be potentially-unavailable, the toolchain will need to branch on Y's availability (by default).

At least, that's the best I've been able to come up with; happy to discuss!

@lukewagner lukewagner changed the title Concern with un-decoded bits and alternative for SIMD Concern with un-decoded bits / alternative for SIMD Mar 22, 2022
@ajklein
Copy link

ajklein commented Mar 22, 2022

Thanks for laying this out in detail. While I understand the concern, I'm skeptical that any approach to this problem that doesn't involve validating un-decoded bits will be sufficiently useful to meet the needs of today's SIMD developers.

Take one important use-case: feature detection based on differences between versions of a single engine. Because the approach described here will only allow high-fidelity feature detection of newly-standardized features on engines which have knowledge of those features, it will never be useful in such a case (since the old versions of the engine couldn't possibly know what opcodes will be standardized in the future), and thus developers will continue to make use of host-provided mechanisms like the JS-based techniques used today. But it's the pain of doing that (building, maintaining, and serving multiple binaries, maintaining & running detecting code, etc) that I see as the central target for @tlively's work here.

The same reasoning applies for feature detection across different engines, since it's unlikely that different engines with different release schedules & priorities will happen to align their releases of a fully-functional version of a new proposal with a "dummy" version in other engines.

Of course none of what I've written above implies that Wasm must add some mechanism for validating un-decoded bits. But I think these use-cases should be met by anything we do add. Or to put the requirement another way, if we're going to add a feature detection mechanism to Wasm for use-cases like that of SIMD additions over time, it must be powerful enough to supersede most (possibly all) of the uses of host-based feature detection (at least once the mechanism itself is standardized and implemented across the ecosystem).

@lukewagner
Copy link
Member Author

I guess there are two distinct-but-related use cases here:

  1. shipping features in a varied landscape where not all the features can be implemented in all the places that a single wasm binary wants to run
  2. shipping new features quickly before all engines have implemented them

and what I'm proposing in this issue only addresses (1), but not (2). (1) seems like a problem that will only increase over time as wasm gets deployed more widely so I expect it's something we'll eventually need to have an answer for. In contrast, (2) seems like a problem that will diminish over time as there are a (hopefully) finite set of SIMD features with finite implementation windows before we start reaching diminishing returns from adding new SIMD features. The specific coefficients matter a lot here, of course, but I think this is something to weigh against the permanence of the above concerns.

Also, practically speaking: how bad would it be in a Web context to deal with lagging browser implementations using a fetch()→transform→instantiateStreaming() polyfill where the transform does the desired bytecode manipulation in terms of stream splicing; it seems like this should perform pretty well.

@titzer
Copy link

titzer commented Mar 22, 2022

I recently implemented decoding and validation of SIMD in Wizard (but not yet execution). I think the difference between decoding the actual instructions and validating their types is actually very small here. (e.g. there are no instructions with "interesting signatures", lending them all to be listable in a table: https://github.com/titzer/wizard-engine/blob/master/src/engine/Opcodes.v3#L287). Keep in mind that decoding requires validating immediates like alignment and lane indexes, and it is incrementally not much more work to check a signature from the table https://github.com/titzer/wizard-engine/blob/master/src/engine/CodeValidator.v3#L779). An engine has to decode the value types and constant initializers as well.

I think we wouldn't be saving an engine much actual work by omitting type validation but still requiring decoding and validation of the V128 type and instructions because there are so many places they can occur (tables, globals, signatures, locals, and in function bodies).

For SIMD as it is, I see four possible levels of engine support:

  1. An engine does not recognize it at all, everything is a decode failure
  2. An engine decodes and validates SIMD fully, with types, but traps on executing any SIMD instruction.
  3. An engine implements SIMD fully, but it may be emulated with scalars (i.e. not "fast" or "native")
  4. An engine implements SIMD fully, and it uses all available native CPU instructions (i.e. is "fast")

I don't see the option in-between 1. and 2. where an engine decodes but doesn't validate SIMD types.

As for the broader discussion,

I am somewhat tempted to agree in general that we want engines to align on what bits are acceptable and want to avoid "circumventing the standardization process" and pushing out non-standard bits that work (perhaps underhandedly) on other engines by virtue of skipping decoding. I would almost be tempted to advocate that all engines support all the features in their frontends, but this will break down for complex features like GC, which are a lot of work, and very subtle, in the front end. Thus, in the case of GC, I think we'll want engines that just refuse to decode these features, treating them as invalid binary bytes.

@rossberg
Copy link
Member

@titzer, on WebAssembly/design#1444 you asked that we find a holistic answer to versioning, and I agree! But my impression is that the discussion so far is going the opposite direction, focussing very narrowly on singular use cases and considering ad-hoc solutions to them.

At the same time, I'm also very sympathetic of @lukewagner's original observation that a generic ability to accept unparsed code paths is a risky proposition. It could make the need for feature detection a self-fulfilling prophecy.

IMO, we should look for something that works

  1. across features (e.g., does not consider SIMD a special unicorn),
  2. across platform requirements (e.g., some will want minimal engine footprint),
  3. across language constructs (i.e., not just handles instructions),
  4. for standardised features only.

For example, GC is an example of a feature that affects many parts of the language, where parsing is relatively easy, but validation is not (and execution might be impossible). So it might indeed be desirable to have something in-between 1 and 2, where engines have to decode it (avoiding Luke's problem) but not to validate it (avoiding unwanted engine complexity). And of course, that would then also work for SIMD.

So abstracting away from SIMD, we can recast Ben's spectrum of "optional support" as follows:

A. Does not decode the feature and rejects it.
B. Decodes the feature but does not fully validate it and traps instead.
C. Decodes and validates the feature accurately but traps nonetheless.
D. Implements the feature, but inefficiently, use discouraged.
E. Implements the feature efficiently.

My assumption would be that some environments or engine makers will probably prefer either A or E for optional features, and we should allow either, uniformly.

To support feature detection, engines would have to be allowed to use one of B or C as well. Feature detection could then be supported in the form of alternative sections, but those are only accepted if all their alternatives decode (B) or validate (C), respectively. (To be clear, B vs C would be a choice in the spec, not by an implementation – and B seems preferable, since it works for more features.)

(I'm not convinced yet that D is a worthwhile option, but of course it can be introduced as well.)

As @ajklein points out, this does not address some of the current SIMD use cases. But I agree with Luke's reply.

@lars-t-hansen
Copy link

The SIMD case is really two cases: engines that don't support SIMD at all and don't know about v128 ("new types + new instructions"), and engines that know about v128 but may not be aware of all instructions ("new instructions").

Consider the latter case, new instructions. It is not limited to new SIMD instructions, and while it may be "diminishing over time" it is not given that we'll just stop adding instructions at some point. Instructions that have been proposed but not spec'd yet that fall into this simple category include carry propagation instructions, byte swap instructions, memory control instruction, and thread management instructions. We will have acquire-release atomic instructions. Looking to some common instruction sets for things missing from Wasm, we find bitfield insert/extract/clear instructions and bit reverse. Looking to the past, we added some instructions post-MVP that could have benefitted from a simple guard: sign extension, non-trapping float-to-int, and arguably the bulk memory and bulk table instructions (even if there were non-instruction additions in the latter proposal).

Common to a number of the messages above is that they present solutions that do not address this problem, which seems important enough and long-term enough to want to be solved.

The "only new instructions" use case is met simply using the proposed feature_block + feature.supported mechanism or the slightly more powerful function-block-level mechanism. Either way, it is simple to spec and simple to implement. The objection to it has been that it allows for undecoded bits.

The problem with wanting to avoid undecoded bits is that there can be what amounts to undecoded bits in the instruction stream already, at least if it's a matter of an implementation wanting to use private instructions. v128.shuffle takes a constant shuffle pattern that in practice is very sparse and can stand in for any SIMD binary operation (and hence also any unary) without impacting its normal use. Since v128 subsumes f64, f32, i64, and i32, the instruction can also be used to express any unary or binary operation on those types, in a combination with lane insert/extract. All that's needed is a condition to ensure that these instructions are executed only on platforms that understand their secret meaning, but that's trivial.

@lukewagner
Copy link
Member Author

I don't think v128.shuffle is in the same category as undecoded bits: while sparse, the wasm spec already specifies a meaning to all validating lane immediates, and thus attempting to claim some particular lane tuple for a non-standard feature would result in an incorrect implementation. This would be in the same category as hiding a feature in the bits of sequences of i64.consts: an engine/producer technically can do it already today, but it doesn't seem likely to be used in a widespread manner for this purpose over time in the same way as, e.g., feature_block.

while it may be "diminishing over time" it is not given that we'll just stop adding instructions at some point

I don't assume we'll stop adding features over time; we definitely will. The question is whether the delay between standardization and implementation will become so bad that it becomes necessary to support undecoded bits in the manner of feature_block. Maybe it will, but given the downsides, it seems like we shouldn't do this preemptively but instead have a concrete problem that can't be solved otherwise. Currently, iiuc, SIMD v.0 has been fully standard for less than a year and is already in 2 browsers, and it seems like the JS polyfill technique I mentioned above could be effective at polyfilling the third in the interim (and, even more concretely, be usable immediately, well before feature_block).

In contrast, iiuc, in the two browsers that do currently natively implement SIMD, there are hardware configurations that don't support SIMD so what I'm proposing above could be adopted in the short-term as a solution for these present-day problems in a minimally-invasive manner.

@rossberg
Copy link
Member

rossberg commented Apr 4, 2022

@lars-t-hansen:

The problem with wanting to avoid undecoded bits is that there can be what amounts to undecoded bits in the instruction stream already, at least if it's a matter of an implementation wanting to use private instructions. v128.shuffle takes a constant shuffle pattern that in practice is very sparse and can stand in for any SIMD binary operation (and hence also any unary) without impacting its normal use. Since v128 subsumes f64, f32, i64, and i32, the instruction can also be used to express any unary or binary operation on those types, in a combination with lane insert/extract. All that's needed is a condition to ensure that these instructions are executed only on platforms that understand their secret meaning, but that's trivial.

Would this allow an implementation to sneak in custom extensions without the code being rejected on conforming engines, though? That possibility is the main worry Luke points out wrt feature_block – it would open the gates to enabling freely shipping code containing "vendor extensions" and all the bads that can follow from that for the future of the standard and its ecosystem.

@lars-t-hansen
Copy link

@rossberg Unlike, say, i64.const there are many bit patterns of the shuffle operation that are effectively meaningless, they would not appear in any sane program (not even in optimized code where multiple patterns had been merged). Thus the shuffle can be used to express private unary and binary operations on v128, f64, f32, i64, and i32, and those operations and programs would validate on a system that supports SIMD (but would produce the desired result only on a system that has the private operation). Suppose I have a private unary i64 operation that hashes the value into an i32, I could express it like this:

    i64x2.splat               ;; express value as simd
    v128.const 0              ;; we need a garbage second operand
    i8x16.shuffle <some pattern that names the desired operation>
    i32x4.extract_lane_u 0    ;; get the result, the rest is garbage

Whether this is sneaky "enough" is another matter, maybe; I don't see how it can easily encode control flow, new types of block structure, or operate on reference types, but perhaps there are ways. (A suitable shuffle operation that carries what is interpreted as an index into a custom section could act as a reference to private code to be inlined in the compiled program, I guess, but it still feels pretty limited.)

More generally, I feel that the core of the use case under discussion is that "there are going to be instructions that not all engines know what to do with at every point in time". It's already possible to work around that with sniffing and supplying multiple versions of a module, or with dynamic linking; it's just not pleasant to do so. A solution that directly addresses the use case will almost certainly need to allow for some bits to be uninterpreted. If the committee can't come to agreement on that we'll have to live with existing workarounds (or new ones, though they're not likely to be much better than the ones we have already). It's not the end of the world; it just adds friction.

@rossberg
Copy link
Member

rossberg commented Apr 4, 2022

@lars-t-hansen, yes, but Luke's argument is that it would be bad to allow (or in fact, require) engines to accept any code that they do not understand, because that would create a backdoor for proprietary vendor extensions that could undermine the standard. And that applies to non-standard opcodes and non-standard immediates alike. Maybe I'm misreading your reply, but I'm trying to understand what difference you see between the two. Right now, no conforming engine would accept unspecified shuffle patterns either, so they are not observably different from unspecified opcodes, AFAICS. And feature blocks would of course allow both.

@lars-t-hansen
Copy link

@rossberg, perhaps we're talking about slightly different things. The shuffle patterns I'm talking about are valid (every lane is less than 32), they're just nonsensical in practical programs. So they would validate, and they have a meaning, but since no useful program would need to express that meaning, they can be assigned a different meaning in an implementation at extremely low risk. The risk would then be further mitigated by gating the use of the "private opcode" on some type of implementation sniffing.

@tlively
Copy link
Member

tlively commented Apr 4, 2022

I strongly agree with @lars-t-hansen and @ajklein that gracefully handling completely unknown SIMD extensions from the future is the raison d'etre of this proposal. If we can't find consensus that this problem is worth solving in the first place, then so be it, but my SIMD users will continue to be frustrated with the relatively poor developer experience we offer for working around this problem.

@rossberg
Copy link
Member

rossberg commented Apr 4, 2022

@lars-t-hansen, ah, now I see what you mean, thanks! Interesting. But of course, a deliberately non-conforming VM would clearly not be blessed by the standard, unlike the (ab)use of feature blocks for a similar purpose. So the amount of "criminal energy" required to exploit that would be on a whole different level.

@conrad-watt
Copy link
Contributor

conrad-watt commented Apr 27, 2022

  • When a feature is unavailable, the static and dynamic semantics treat all instructions of that feature as-if they were immediately preceded by an unreachable instruction, and thus they are decoded as dead code.

    • I'm assuming here we adopt Conrad's relaxed-dead-code-validation proposal so that this dead code is really-truly only decoded, not validated-in-fun-ways as it is today.

Just to note, in the latest iteration of that proposal, the idea was that any validation rules not dependent on popping types from the type stack during validation would still be applied in dead code. There was a general sense that we didn't want to relax the type system any more than strictly necessary, and that we wanted to limit how wacky one could get with smuggling invalid instructions/random bytes into dead code.

EDIT: and I think more importantly, there was a concern that we haven't previously exposed any distinction between decode-time and validation-time errors, and the spec isn't currently factored in a way that would make this easy

@rcombs
Copy link

rcombs commented Aug 3, 2022

Re: @lukewagner's comment earlier:

I guess there are two distinct-but-related use cases here:

  1. shipping features in a varied landscape where not all the features can be implemented in all the places that a single wasm binary wants to run
  2. shipping new features quickly before all engines have implemented them

I'd just like to note that this seems to kind of sidestep a common "in-between" state: shipping new features quickly before dropping support for all versions of all engines that have not implemented them. Many applications wish to support several versions of common browsers and operating systems, and anything targeting embedded use-cases (smart TVs, etc.) will need to support several-year-old Chromium and WebKit at a minimum. Runtime compatibility checks don't just mean "you can start using new features before Safari ships them"; for many, they mean "you can start using new features before making the hard decision to drop support for devices manufactured the year the feature was introduced".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants