Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives to i31ref wrt compiling parametric polymorphism on uniformly-represented values (OCaml) #100

Closed
sabine opened this issue Jun 27, 2020 · 109 comments

Comments

@sabine
Copy link

sabine commented Jun 27, 2020

As i31ref doesn't seem to be an unanimously-agreed-on (see #53) part of the GC MVP spec, I am very interested in discussing what the concrete alternatives to it are in the context of parametric polymorphism on uniformly-represented values. (I would appreciate if the answer doesn't immediately read as "use your own GC on linear memory".)

To give some (historical) context: Why does OCaml use 31-bit integers in the first place? Generally, it is possible, to have a model of uniform values where every value is "boxed" (i.e. lives in its own, individually allocated, heap block). Then, every value is represented by a pointer to the heap and can be passed in a single register when calling a function. A heap block always consists of a header (for the GC), and a sequence of machine words (values). From an expressiveness standpoint, this is fine. However, when even simple values such as integers are always boxed (i.e. require a memory access to "unbox" them), performance suffers. Design constraints for the representation of unboxed integers were: a) need to be able to pass unboxed integer values in a single register, and b) need a means for the GC to distinguish (when crawling the heap) whether a value in a heap block represents an unboxed integer or a pointer to another heap block, c) being as simple as possible for the sake of maintainability. In OCaml, the compromise between performance and simplicity that was chosen is to unbox integer values by shifting them left by one bit and adding one. Since pointers are always word-aligned, this made it trivial to distinguish unboxed integers from values that live behind heap pointers. While this is not the best-performing solution (because all integer arithmetic has to operate on tagged values), it is a simple one.

Note that there exist compilation targets of OCaml that use 32-bit integer arithmetic, and the OCaml ecosystem largely accounts for that. Having libraries also consider the case where integers have 64-bits seems feasible. Some code will get faster if we can use native 32-bit integer arithmetic.

Ideally, for the sake of simplicity, we would like to emit one type Value to WebAssembly, which represents an OCaml value, which is either:

  • a reference to a heap block
  • an unboxed value that fits into a machine register
  • a reference to some opaque value from the outside world (traditionally, for OCaml, this is C, but in WebAssembly, this could be either an anyref or some address in the linear memory of another module)

A heap block of OCaml traditionally consists of

  • a header word that holds a) a tag that gives some information about the block, b) GC color bits (obsolete with WASM GC), and c) the size in machine words of the block
  • a sequence of machine words, length of it being the size specified in the header.

The most trivial representation (i.e. the one matching most closely the existing one) that I see when I look at the MVP spec is an anyref array that holds both references to other heap blocks and i31ref values. So, from the viewpoint of having to do as little work as possible in order to compile to WebAssembly and keeping the implementation simple, i31ref is certainly looking very attractive for an OCaml-to-WASM compiler MVP.

In #53 (comment), @rossberg summarized:

For polymorphic languages, there are these [heap representations]:

  1. Pointer tagging, unboxing small scalars
  2. Type passing, unboxing native scalars, runtime type dispatch
  3. Type passing, unboxing native scalars, runtime code specialisation
  4. Boxing everything
  5. Static code specialisation

From OCaml's perspective, I think that (2) and (4) don't seem acceptable as a long-term solution in terms of performance. Here, compiling to the WASM linear memory and shipping our own GC seems a more attractive choice.

So, that leaves (3) and (5).

(3) seems fairly complex. If the WebAssembly engine would do the runtime code specialization, or if we could reuse some infrastructure from another, similar language, it could be worthwhile for us to work with that. It currently seems unlikely that OCaml in general will switch to (3) in the foreseeable future, unless we can come up with a simple model of runtime code specialization. I expect that implementing runtime code specialization in a WebAssembly engine goes way beyond a MVP, so it seems unlikely this will happen.

(5) is simpler than (3) in the sense that we do not have to ship a nontrivial runtime. If we analyze the whole program in order to emit precise types (struct instead of anyref array) for our heap blocks on WebAssembly, we wouldn't need to use i31ref and we can reap the other benefits of whole-program optimization (e.g. dead-code elimination, operating with native unboxed values, no awkward 31-bit arithmetic). Still, this will be a sizeable amount of work (possible too much to do it right away).
I also can't say how bad the size of the emitted code will be in terms of all the types we need to emit. Instead of emitting a single Value type, we need to emit one struct type for every "shape" of heap block that can occur in the program. To keep this manageable, we need to unify all the types whose heap block representations have the same shape.
Then, static code specialization kills one nice feature of OCaml: separate compilation of modules. However, instead of doing static code specialization before emitting WebAssembly, maybe it is possible to implement a linker for our emitted WebAssembly modules that does code specialization at link time if we emit some additional information to the compiled WebAssembly modules? This kind of linker could possibly be interesting to other languages that are in a similar position as us, as well. Obviously, link time will be slower than we are used to. I haven't thought this through in detail at all.
It seems likely that these issues are manageable, if enough effort is put into them.

Edit: while the previous paragraph sounds fairly optimistic, looking into whole-program monomorphization (turning one polymorphic function into several non-polymorphic ones) more closely, it is definitely not trivial to implement. Types that we would need for this are no longer present at the lower compilation stages. When I look at the MLton compiler (a whole-program-optimizing compiler for Standard ML), it seems that it is a good idea to monomorphize early, in order to be able to optimize based on the types of parameters. Features like GADTs, and the ability to store heterogenous values or polymorphic functions in hash maps (or other data types) do not make it simpler. It looks to me like this would mean almost a full rewrite of the existing compiler and it is not obvious whether every function can be monomorphized (without resorting to runtime type dispatch).

Are we missing something here, are there other techniques that we have been overlooking so far? Feel free to drop pointers to good papers on the topics in general, if you know some.

Also, I am very interested what perspective other languages with similar value representations have on this and whether there is interest in collaborating on a code-specializing and dead-code eliminating linker.

@RossTate
Copy link
Contributor

Although you've ruled it out, I would actually suggest (4) for OCaml, albeit with some optimization.

First, the optimization I have in mind is to have OCaml function definitions operating on int compile to wasm function definitions operating on i32. For polymorphic uses of these functions, have another definition that works on whatever "uniform" representation of OCaml values you use, simply calling the optimized function with appropriate (un)boxing. Or more generally, when a value can be statically determined to be an int, represent it as i32.

There are a few reasons I suggest this:

  1. A number of other languages have found this approach effective enough. See various JVM languages like Java, Kotlin, and Scala that all take this approach.
  2. It's still possible the MVP will have i31ref. It's also possible it'll instead have something like i32ref, which would generally be boxed on 32-bit machines and unboxed on 64-bit machines.
  3. Even if the MVP doesn't have these things, the MVP is just the first release. WebAssembly will advance, possibly in ways that would make specifically 31-bit integers often (if not always) be unboxed for modules that specifically request it.

So in short, I suspect the short-term costs of this approach are more acceptable than you might expect (with some local/composable optimization techniques), and I suspect it is more likely to adapt well to integrate the improvements that will be made to WebAssembly over time.

There's also a meta-point to consider: having WebAssembly modules compiled from OCaml in this manner will give the CG realistic programs to experiment with and analyze, which can be used to determine if it's worth the effort (at all levels of the process) to give modules more control over their low-level representations. Right now we can only hypothesize.

That all said, you know your specific needs much better than I, and even if this approach might work well for the OCaml community broadly, it still might not be well-suited to more specialized concerns you might have.

@sabine
Copy link
Author

sabine commented Jun 28, 2020

Fair enough, (4) is close enough to (1) to look like the next best solution in terms of effort needed and we could use i32 arithmetic. I'll look into this in more detail to see if it is indeed so straightforward.

@sabine
Copy link
Author

sabine commented Jun 28, 2020

I'm wondering if I can do this by first compiling to "WASM GC MVP with i31ref" and then translating that code to "WASM GC MVP without i31ref". At least, this way, I will not have to go all the way back to unbox all these integers again when a feature that is equivalent to 31ref arrives. Has anyone done that?

@RossTate
Copy link
Contributor

Even with i31ref, you have to coerce integers to and from i31ref using instructions i31.new and i31.get_u/i31/get_s. So, if the design of i31ref changes, it should be pretty trivial to change your compiler to substitute in the appropriate coercion instructions.

The more difficult thing to change would be using 31-bit vs. 32-bit arithmetic. It sounds like putting in the effort for 31-bit arithmetic is only worth it if i31ref exists. Since that's not guaranteed, I'd use easier 32-bit arithmetic for now, and then put in the effort to switch to 31-bit arithmetic only if/when i31ref has been finalized.

@rossberg
Copy link
Member

@sabine:

Then, static code specialization kills one nice feature of OCaml: separate compilation of modules.

That would be the smallest problem. More importantly, as you note in your edit, static specialisation kills any form of first-class polymorphism, of which OCaml has plenty (polymorphic methods and record fields, first-class modules, GADTs, ...). So I think this approach is simply impossible to use. The same is true for any other language with a rich enough type system.

@RossTate:

For polymorphic uses of these functions, have another definition that works on whatever "uniform" representation of OCaml values you use, simply calling the optimized function with appropriate (un)boxing.

That is not a scalable approach. In languages like OCaml, you have polymorphic type inference, and by design that results in way more polymorphism than in traditional languages. You also use lots of abstract types. In particular, many functions are polymorphic in multiple independent types, and refer to multiple abstract types. To be efficient, the approach you suggest would require an exponential number of specialisations to be generated upfront for each such function.

There also is the practical problem of porting. If the GC proposal does not support established implementation techniques, but instead requires a major rewrite of a pre-existing compiler and runtime, then it will not be a plausible target for many existing languages.

@RossTate
Copy link
Contributor

The advice I gave here is specific to OCaml. There is only one type we were talking about doing this for: int. You can do these coercions per use site, and you can consolidate coercions if you want. These are conceptually coercions between monomorphic and polymorphic calling conventions. There will need to be coercions between curried and uncurried calling conventions anyways. That is, an int -> int -> int can and should be called with two arguments (i.e. uncurried), but it can also be given to of type (int -> 'a) -> int -> 'a, in which it must be called with one argument (i.e. curried). That polymorphic function will then return a function using a curried calling convention and so that result often needs to be coerced to an uncurried calling convention. That is, unless you are suggesting that all functions always use the curried calling convention, since WebAssembly does not currently support the sort of variadic arguments that would be necessary to eliminate these coercions. So my suggestion just hooks into infrastructure that would need to be in an OCaml-to-WebAssembly compiler anyways.

@gasche
Copy link

gasche commented Jun 29, 2020

There has been a good amount of research work on type-directed or shape-directed unboxing, supported by code specialization. For a recent example, see the work on call-graph-based specialization of (boxed) polymorphic functions as part of the 2017 thesis of Dmitry Petrashko on Scala. But these work typically require a fair amount of sophistication. In contrast, having an efficient way to "box" immediate scalars without going through the heap is easy for backends to allow, and fairly easy to compile into for types that fit the smaller width. The two approaches are complementary (clever specialization optimizations has other benefits), and yes, boxing everything is of course possible if the low-level substrate does not provide better, but getting reasonable performances requires much more sophistication on the user side.

Many of the successful language implementations work by keeping a tight check on their complexity budget: they go for the 20% of sophistication that brings 80% of the benefits in as many areas as possible. OCaml was faster than most ML, or Chicken Scheme than most Schemes, not through extremely optimizations, but rather thanks a few very-important optimizations (static function calls) and good data-representation choices that ensured that typical programs were fast by default.

It is also possible to build fast systems that are slow by default but fast through excellent, sophisticated engineering (good JVMs or CLR implementations, GHC, Graal+Truffle, etc.), but I agree with @rossberg that it is important to ensure that the 20%-80% approaches are available first.

@RossTate
Copy link
Contributor

The issue is "alternatives to i31ref", and the concerns are fairly specific to OCaml due its use of specifically 31-bit arithmetic. (Haskell, for example, only guarantees 30-bit precision because they have found reserving 2 bits for GC to be useful.) There is already another very long thread, #53, about i31ref. Let's not pull this thread into an argument about i31ref, an argument that cannot be conducted well if we have not even considered the alternatives, which is what this thread is about.

@rossberg
Copy link
Member

Consider a function like this one from the Map interface:

val mapi : (key -> 'a -> 'b) -> 'a t -> 'b t

This is polymorphic in two type variables and two abstract types. Each of them could either be an integer or something else. So even with just int, that makes 2^4 = 16 different ways in which you'd have to specialise the code according to your suggestion (8 if the compilation scheme allows to fix the export type t).

Or you reduce the number of specialisations, lump several dimensions together, and pay the extra cost of frequent boxing/unboxing conversions for the rest. Many variations of this have been investigated in highly polymorphic languages, and there are reasons few are used in practice. Leroy (among others) actually had a line of papers researching unboxing for polymorphic languages, yet OCaml, like many similar languages, uses a uniform representation instead. He observed that the conversion overhead often is higher than the benefit of unboxing, and that a minimal unboxing scheme produces better results for typical profiles.

I'm not intimately familiar with how OCaml's native code compiler handles currying, but the byte code compiler handles it through clever dynamic stack techniques that are unrelated to compile-time unboxing.

@rossberg
Copy link
Member

the concerns are fairly specific to OCaml due its use of specifically 31-bit arithmetic.

No, they're not, and it's getting a bit boring at this point that I have to keep refuting that. So again: plain ints are just one particular use case. The general purpose is unboxing small scalars of any sort, for which there are many, many use cases, in many, many languages. (And of course, 30 bit ints would fit just fine as well.)

@jayphelps
Copy link
Contributor

As a bystander who follows these threads I wanted to share some feedback that this aggressiveness is quite off-putting.

@RossTate
Copy link
Contributor

but the byte code compiler handles it through clever dynamic stack techniques that are unrelated to compile-time unboxing

Unfortunately, WebAssembly does not support these techniques. So I believe coercions will be necessary instead. @sabine, have you considered this problem with currying?

To clarify, I was not suggesting specializing polymorphic functions based on their type arguments. I was suggesting coercing monomorphic functions that work specifically on integers to polymorphic functions operating on the uniform representation (when used polymorphically). This could be done with an augmentation of the coercion system I was suggesting above for dealing with currying.

@rossberg
Copy link
Member

To clarify, I was not suggesting specializing polymorphic functions based on their type arguments. I was suggesting coercing monomorphic functions that work specifically on integers to polymorphic functions

Oh, okay, but that's probably even worse. You would have to box/unbox in all first-order contexts where you pass a value from polymorphic to monomorphic functions and vice versa, which is like all the time. And you'd still have the same problem: for a function of type int->int->int, there are 8 different possibilities of a polymorphic context to which it could be passed in a higher-order fashion. All would have a different calling convention under the approach you suggest. So either your function has 8 different specialisations, or you're creating wrapper functions at each polymorphic use site.

@aardappel
Copy link

@sabine thanks for summarizing things so clearly.

First, until we invent JIT and other features, the current Wasm, even when including the current GC proposal, is an entirely static, dare I say "closed world" system, with AOT compilation to a single .wasm, with concrete types at the boundary. By definition, all forms of polymorphism, modularity, separate compilation etc will be resolved by the time that .wasm gets baked, no matter what the language likes to present beforehand. This alone should make (5) by far the most realistic option for most languages that care about speed, again, in the current way Wasm works.

There seems to be a lot of default fear of (5), that it going to result in unreasonable code bloat. If you generate specializations under a closed world assumption, you only generate the ones that are actually used. This causes more in-lining of single-use functions (not further increasing code size) followed by much more aggressive optimisation of inlined code working directly on specific naked (scalar) types, often shrinking the code back down significantly. It turns indirect calls into static calls, helping greatly with dead-code elimination. LLVM and binaryen can do endlessly more on code like this than code that stays generic. In my personal experience working with (5) a lot, the amount of code bloat can be remarkably small, and the resulting speedups due to code "collapsing" impressive.

But @rossberg is correct that anyone that prefers (5) can make this happen, regardless of the presence of i31ref. So why would we want i31ref additionally?

I'd like to see an example of polymorphism that is impossible to compile down to specialized code using (5), and requires a struct of all anyref to work. I think, by definition, that such an example does not exist (given that we don't have any dynamic code features in Wasm that would make this statically impossible).

There will be people who have code size as their #1 concern, and don't believe my story above that it can actually help reduce code size, when done right. That's fine, because it is hard to make claims about this in the absence of specific languages with specific compilers and specific applications.

Finally, are there downsides to having i31ref when you don't use it? Under what circumstances does a Wasm implementation have to check "is this thing an i31ref?" before accessing a pointer? If anyref is always a pointer, always pointing to a heap object where the initial bytes always have the same layout (e.g. a type id), I can imagine that would lead the speedups. If indeed there are frequent checks required for i31ref, that to me would be the strongest reason to not wanting to have it.

@rossberg
Copy link
Member

@aardappel, repeating what I pointed out above, and I believe many times before: static type specialisation does not work in a language with first-class polymorphism or polymorphic recursion.

Simplest possible example in OCaml:

type poly = {f : 'a. 'a -> unit}
let g {f} = f 1; f "boo"

There is no way you can statically specialise anything here, since you don't know what polymorphic definition you are calling. It's completely dynamic. A client of g could pass anything for f, it could be taken from a list, or any dynamically computed data structure. Inversely, the site defining an f generally has no way of knowing where it flows and what types it will be used at.

The same is true for generic methods in a language like C#, which is why the CIL performs jit-time type specialisation. And it is the reason why C++ does not allow "template virtual methods", because its simplistic compilation model cannot support them.

Static specialisation also does not work in a language with polymorphic recursion, because the set of instantiations can be statically unbounded.

If I could make a wish, then that folks in these discussions could accept the fact that there are very good reasons why certain classes of languages are implemented in certain ways, that there has been decades of engineering and research into it, that we are not smarter than all the many folks who did this, and that we need to support a mapping for those compilation techniques instead of hoping for some yet undiscovered trick to make these requirements go away. Please?

@gasche
Copy link

gasche commented Jun 29, 2020

First, until we invent JIT and other features, the current Wasm, even when including the current GC proposal, is an entirely static, dare I say "closed world" system, with AOT compilation to a single .wasm, with concrete types at the boundary. By definition, all forms of polymorphism, modularity, separate compilation etc will be resolved by the time that .wasm gets baked, no matter what the language likes to present beforehand

I'm confused, doesn't wasm allow producing modules and linking them together? I would expect compilers to use those to produce modules at the natural separate-compilation granularity of their source language. In particular, I would expect to compile each module/crate/package separately, without making assumptions on how other modules are going to use it. Of course you can still generate WASM code only after a pass of link-time-optimization, but I would naively expect that it can lead to code-distribution issues (if the generated WASM for my library depends on the clients it's compiled with, code caching etc. is going to be more delicate to handle).

I'd like to see an example of polymorphism that is impossible to compile down to specialized code using (5), and requires a struct of all anyref to work.

It's easy to have examples where specialization leads to impressive blowups in code size. I mentioned the Scala specialization work by Dmitry Petrashko earlier; it starts by evaluating previous 2014/2015 work on call-graph-based type specialization in Scala on the codebase of the Dotty compiler, and found out that on average each method would be specialized 22 times.

Then there is the example of polymorphic recursion, where the set of different type instantiations depends on runtime data (it particular it may be arbitrarily large). (In most cases I'm familiar with, the number of different "shapes" may be bounded, especially if you only distinguish immediates vs. pointers.)

But then what about union types? Many language runtimes use representations that are either an immediate word or a pointer, with tagging to distinguish the two. This is ubiquitous for example in implementations of ML, Haskell, Scheme and what not. You can very easily have tuples or arrays of such immediate-or-pointers values, where the immediate-or-pointerness of any element is only known at runtime and can change on mutation. How would one operate on this data if functions are expected to know statically whether they take an immediate or a pointer value?

Under what circumstances does a Wasm implementation have to check "is this thing an i31ref?" before accessing a pointer? If anyref is always a pointer, always pointing to a heap object where the initial bytes always have the same layout (e.g. a type id), I can imagine that would lead the speedups.

I'm not very familiar with Wasm (I was drawn into this discussion by chance), but I believe that:

  • programs using int31ref would need to check for it when they know it can happen; this does not affect the performance of programs not using this type
  • the Wasm GC would need to check int31ref before following a reference (because in this case it's not a valid pointer); checking one bit of a word is practically free compared to dereferencing the word as a pointer
  • someone willing to propose runtime operations operating on any wasm value (serialization, etc.) would do similarly need to check for int31ref on anyref values; again this is extremely cheap

If I understand correctly, Wasm implementations try to be safe from undefined behavior, so in particular they are careful to check that pointers go into allowed memory before dereferencing them. This monitoring discipline, and in general wasm sandboxing/CFI features, are going to completely drown out the cost of checking bitpatterns on a word (even if you decided to do checks or masking on each dereference).

@aardappel
Copy link

@rossberg your example is essentially a function pointer to a generic function if I'm not mistaken, which is indeed not possible in languages that rely entirely on monomorphic specialization. I'd expect a compiler to choose to force boxing on args for such functions (causing an extra heap alloc for f 1).

I don't see why generic functions that are not used in this very dynamic fashion (which I would certainly hope are the vast majority) should suffer from the mere presence of this possibly very dynamic use, and certainly not why Wasm as an eco system should pay for this cost (if the mere presence of int31ref introduces conditionals). I'd hope the cost to only be born by functions that use this functionality.

I am well aware that there are endless people who spend longer time looking at this than me, but just an "appeal to authority" is not going to be sufficient to stop me from at least questioning this direction.

@RossTate
Copy link
Contributor

If I could make a wish, then that folks in these discussions could accept the fact that there are very good reasons why certain classes of languages are implemented in certain ways, that there has been decades of engineering and research into it, that we are not smarter than all the many folks who did this, and that we need to support a mapping for those compilation techniques instead of hoping for some yet undiscovered trick to make these requirements go away. Please?

The Glasgow Haskell Compiler, which presumably is considered a production compiler within this class of languages, does not unbox Int even though the Haskell semantics specifically permits Int to have only 30 bits (a number they arrived at because other runtimes that do unbox Int found it useful to reserve two bits for GC). From what I understand, they did so because they found that algebraic operations were so common that it was more useful to embed the algebraic-constructor tag in the low bits of the pointer. So the experience of Haskell developers is that either custom pointer tagging should be preferred over i31ref for better performance or i30ref should be preferred over i31ref for better GC strategies. Having looked through a number of language implementations, this sort of gap seems pretty common. Unfortunately for OCaml, it is an outlier because I believe it is the only language anchored specifically around 31-bit integers. This is why I thought it would be useful to focus specifically on advice for OCaml.

@aardappel
Copy link

@gasche

I'm confused, doesn't wasm allow producing modules and linking them together? I would expect compilers to use those to produce modules at the natural separate-compilation granularity of their source language.

Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore. This may change in the future but is TBD. Also, current linking of multiple Wasm modules is runtime-only, though static linking is planned.

and found out that on average each method would be specialized 22 times.

Average? Each method in the entire compiler? Meaning many methods would be specialized hundreds of times? (to account for the methods only used once). I find that hard to understand how that is possible. Does this account for optimization and DCE of the now much more static blown up code?

Would be more useful to know how much a codebase would blow up in total, for example the ratio of AST nodes of a large program before and after specialization. In my tests for example, in cases with heavy nesting of higher order functions, that ratio was at most 2x (after optimization). All very anecdotal of course.

If I understand correctly, Wasm implementations try to be safe from undefined behavior, so in particular they are careful to check that pointers go into allowed memory before dereferencing them. This monitoring discipline, and in general wasm sandboxing/CFI features, are going to completely drown out the cost of checking bitpatterns on a word (even if you decided to do checks or masking on each dereference).

An anyref is entirely under the control of the Wasm implementation, thus typically would need no checking of any kind before dereferencing as a pointer. Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.

(sandboxing is only done for linear memory pointers, but even there it has no cost typically due to use of memory mapping features)

@rossberg
Copy link
Member

@aardappel:

I don't see why generic functions that are not used in this very dynamic fashion (which I would certainly hope are the vast majority) should suffer from the mere presence of this possibly very dynamic use

You don't know at its definition site which polymorphic function ends up being used that way, and with what degree of polymorphism. The composition can happen somewhere else entirely, e.g. in a different module. And it can be partially instantiated, i.e., you plug in a polymorphic function somewhere where a less polymorphic (but not monomorphic) type is expected. There are many degrees of freedom, and you generally need a compilation scheme that is both efficient and compositional for all of them.

In the limit, the possibilities for polymorphic composition are almost similar to those in a dynamic language, which is why they make similar representation choices. (Though an important difference is that there usually is no observable type dispatch, because polymorphism is fully parametric.)

@RossTate:

As I have stated many times before, whether it's i31 or i30 is largely immaterial. This is an internal representation type. For the vast majority of use cases its max size doesn't matter. However, we can't easily make extra pointer bits available to user space across engines anyway, so AFAICS, we can just as well provide 31 bit range for tagged ints. But if there are reasons to pick another width, then that's totally fine as well. I just haven't heard them yet.

There are many languages that use (wordsize-1) bit integers in their implementations one way or the other. Whether a language implementation exposes that to users is a completely separate question. Some do (not just OCaml), some also expose values like 30, 29, or 28. But where they do, the language definition typically does not guarantee an actual size. For example, OCaml explicitly states that int_size can have other values. So again, size doesn't matter.

@aardappel:

Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore.

Huh? I'd fully expect that a language with a proper module system can (and should!) compile its modules to Wasm modules separately, and without any loss of generality. Wasm's module system allows you to import/export anything, so such a compilation scheme should be perfectly possible.

An anyref is entirely under the control of the Wasm implementation, thus typically would need no checking of any kind before dereferencing as a pointer. Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.

You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.

@jakobkummerow
Copy link
Contributor

Checking its bits, and potentially branching (and maybe missing) would be an entirely additional cost int31ref would bring. So no, not quite for free.

You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.

That is indeed the one place where adding i31ref has a non-zero cost: when ref.test or ref.cast or br_on_cast operate on a value that's statically typed as anyref or eqref, then before loading and comparing that value's type descriptor, they first have to check the pointer's tag bit. Given that the type descriptor check (which needs to happen with or without i31ref's existence) involves a load from memory, and the tag bit check doesn't, I'm confident that the additional overhead is negligible.
Any code that has more specific static knowledge than eqref is entirely unaffected by the existence of i31ref. Code that passes around anyref values (without downcasting them) is just as unaffected.

@sabine
Copy link
Author

sabine commented Jun 30, 2020

As a bystander who follows these threads I wanted to share some feedback that this aggressiveness is quite off-putting.

I don't think anyone contributing here wants to come off as aggressive (I certainly don't want to 😄). This i31ref part of the spec is a rather controversial issue in both a technical and political sense:
(a) WASM is this great thing that everyone wants to be part of and that, by aiming for strong safety guarantees, sets the bar quite a bit higher for everyone, compared to the commonly-used traditional architectures that we are used to.
(b) so, all kinds of different people are coming together. Many people (language implementers, engine implementers) are coming here with existing codebases. When looking at compiling to WASM or implementing WASM, we discover choices in our codebases that make it difficult to work with WASM. Some things that were before considered acceptable in the existing codebase now present as technical debt. Some choices are very fundamental to the codebase and it is not immediately obvious whether or even how we can undo and rework the codebase at a reasonable cost. So we all try to understand how to work with the spec in a way that is maintainable in the long run.
(c) all involved here care deeply about performance. With WASM being a melting pot for all kinds of languages, none of the language implementers can expect to get a WASM that caters optimally to all concerns of their language (or, for that matter, a WASM that is trivial to compile to). There are so incredibly many constraints and requirements coming from all the languages overall. Still, we consider reasonable tradeoffs in performance worth it in the overall scheme, the premise of being able to run code anywhere by compiling to WASM. One important task of WASM engine implementers is to watch over the specification and look for features that introduce unnecessary performance costs, while all the compiler implementers try to lobby for and ask for features that they can make use of (ideally, while keeping in mind the overall goal of making WASM an efficient compilation target for many languages). The goal here for WASM GC is to strike a balance between WASM GC being small and fast and providing the expressivity that makes it a reasonable compilation target.
(d) resource constraints matter for all of us. Decisions need to be made and we all need to put only so much on our plates that we can still handle that and get something done. I appreciate very much that WASM is not some huge gigantic beast, but a rather compact language whose specification can be read and mostly understood in a very short time.

I'm trying to summarize and sometimes comment on some of the points brought up. If you feel that I missed your point or something needs clarification, do speak up.
@rossberg says that first-class polymorphism, of which OCaml has plenty (polymorphic methods and record fields, first-class modules, GADTs, ...) makes it impossible to do static code specialization to the point WASM GC without i31ref requires. OCaml compiler implementers that I asked seem to share this idea and provided some hard-to-monomorphize examples. I know too little about all the features of OCaml at this point (I have been working with the compiler only since September part time and I still have a lot to learn.) to say whether full-program monomorphization is possible or not - I do not have a proof for either of these hypotheses. @rossberg do you know good papers or theses on this topic?
@gasche says that type-directed or shape-directed unboxing, supported by code specialization is a fairly sophisticated technique for compilers to implement, while, in contrast, having an efficient way to "box" immediate scalars without going through the heap is easy for backends to allow, and fairly easy to compile into for types that fit the smaller width. I agree with your assessment, that OCaml, in particular, chose to go with not many optimizations by choosing a memory model that brings 80% of the performance to be expected to the table. The implicit assumption that the memory works this way, that you can put an unboxed integer into a memory cell in a heap block is being relied on in large parts of the compiler (in particular those parts where enough optimization has been done in order to make them a worthwhile starting point for compiling to WASM).
@aardappel says There seems to be a lot of default fear of (5), that it going to result in unreasonable code bloat. Looking into MLton, I believe now that code size with monomorphization is likely not the major problem, as the bloat is also offset by the ability to do dead-code analysis in a whole-program compiler.
@gasche says Many language runtimes use representations that are either an immediate word or a pointer, with tagging to distinguish the two. Indeed, we have the same representation in OCaml. type t = X | Y of int is a type that has two variants, X and Y. The variant X, which has no additional data attached, is represented as an integer, while Y is represented by a heap block with a field (tag) that says it's Y and another field that holds the attached integer value. This representation is introduced so early in the compiler that we do not have the information to undo it at a later in the compilation pipeline where we could branch off to WASM with reasonable implementation cost. I could call this either "technical debt" or a natural consequence of the choice of memory model permeating the whole compiler. There are people from the OCaml compiler team who believe that it should be possible to rewrite the compiler to introduce a different memory representation for variant types, but it will require a lot of effort.
@rossberg says that WASM should support a mapping for fundamental compilation techniques of polymorphic languages. As someone who wants to bring OCaml to WASM, I'm obviously very much in favor of that. In contrast, @aardappel sees WASM GC MVP as a target aimed at languages that can compile via static code specialization. Which, by looking at all the parts of the spec apart from i31ref seems mostly true.

Concerning the topic of separate compilation of modules: So far, we are not aware of any problems with separate compilation of modules for OCaml, in the presence of i31ref. Obviously, in a MVP compiler backend, to a MVP language running on MVP engines, only modules compiled with the same compiler version will be compatible. That's perfectly fine. No complaints whatsoever here. 😄 Users will understand.

To get back to the topic of alternatives to i31ref: When I think of an alternative to i31ref, I think of something that enables by some means a memory model that allows us to, with reasonable efficiency,
(1) store heap pointers and integers in the same array and read them back out
(2) in a way that we can, at runtime, check whether a value is an integer or a heap pointer
so that we can write a compiler backend for the OCaml compiler that compiles to the WASM GC MVP with the resources we have at hand.

Traditional hardware architectures have a memory model that lets us fulfill these requirements by pointer tagging integers at the cost of losing one bit on the integer representation - but there may be other reasonable alternatives that I don't see right now, that work for us, and that other languages could make better use of than i31ref. Possibly the alternative can live outside of WASM GC MVP, but I am not sure about that at all. If there was some external tool with a memory model that fulfills (1) and (2) and that compiles to WASM GC MVP, with a long-term expectation of reasonable performance, we could work with that. I don't know how that could work, though.

I believe that these two requirements are all that we need in order to compile all the crazy polymorphism in OCaml. I think @rossberg is right that, OCaml, in the long term, can work with 30-bit integers or 31-bit integers, or even, crazy as it may seem, 29-bit integers.

Requirement (2) can possibly be lifted after a major refactor of the OCaml compiler (which is not 100% guaranteed to succeed, and that we don't have resources for right now, but at least there are people optimistic that it could work).

Having both requirements (1) and (2), it looks to me like I need to double-box integers in a MVP without i31ref. I'll elaborate on that later.

I'm very interested in other languages who would use i31ref to compile polymorphism (no matter whether the language is dynamically or statically typed). Even more so, I am interested in the perspective of languages who need to compile extensive polymorphism but do not see i31ref as useful, needing something else. Please comment and describe your constraints for the memory model.

However, I expect that as soon as i31ref makes it to the implementation stage, people will start asking for i63ref.

@RossTate
Copy link
Contributor

Thanks for the great post, @sabine!

I think @rossberg is right that, OCaml, in the long term, can work with 30-bit integers or 31-bit integers, or even, crazy as it may seem, 29-bit integers.

I was just about to ask you this question, so thanks for beating me to it! This is extremely useful, as it's the kind of perspective we can only get from language implementers.

Can I ask you for some more of your perspective along this line? Something I've been wondering is that, if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too? The number that comes to mind is 24 bits because that's 3 bytes (so just enough for RGB and a little more than enough for some Unicode encodings). So I'm interested to hear specifically your thoughts on whether that would work for OCaml? That is, do you know of any OCaml applications or implementation strategies that would suggest a useful lower-bound on the number of bits needed for unboxed integers?

@fgmccabe
Copy link

JS implementations faced a similar dilemma with the need to optimize simple integers. The dominant technique there seems to be nan-boxing; which in the language of i31ref would be equivalent to i52ref.
TLDR: the discussion becomes moot when we migrate (which we no doubt will) to 64bit wasm.

@aardappel
Copy link

@rossberg

You don't know at its definition site which polymorphic function ends up being used that way, and with what degree of polymorphism. The composition can happen somewhere else entirely, e.g. in a different module.

It is my assumption that if you compile down to a single .wasm, there comes a moment where with static specialization you can know the vast majority of these cases. If your default way of working involves completely independent compilation to separate .wasm files, then yes, you more often cannot, but then again you're kind of bringing this upon yourself.

You can't deref an anyref. You first need to cast it to something concrete. That necessarily involves a check, with or without i31ref.

I was thinking of engine code, like the GC traversing objects. And checks can be more or less expensive.

@jakobkummerow

That is indeed the one place where adding i31ref has a non-zero cost: when ref.test or ref.cast or br_on_cast operate on a value that's statically typed as anyref or eqref, then before loading and comparing that value's type descriptor, they first have to check the pointer's tag bit. Given that the type descriptor check (which needs to happen with or without i31ref's existence) involves a load from memory, and the tag bit check doesn't, I'm confident that the additional overhead is negligible.

That bit check may involve a missed branch, but yes, for code that doesn't use i31ref (which is what I'm concerned about) you'd never get a branch miss. Conversely, one could regard loading the type field to have no cost if the subsequent code is going to touch all the following fields anyway.

I'll take your word for it for now that cost will be negligible, though I hope at some point we'll have some numbers, particularly a real world GC benchmark (not containing any i31ref values) with bit checks turned on and off. I'm a big fan of the "you pay for what you use" principle, in Wasm, and all of computing.

@aardappel
Copy link

aardappel commented Jun 30, 2020

@fgmccabe

TLDR: the discussion becomes moot when we migrate (which we no doubt will) to 64bit wasm.

It's not just about how many bits we can fit in a pointer, it is also about whether we want to force the need to check these bits upon languages that don't need it (or languages that may need it, for code where it can be statically known that it's not needed).

64-bit wasm is mostly about 64-bit operands to linear memory load/stores, and thus larger linear memories, not much else. The size of an anyref would still be entirely an implementation choice, and in theory, an implementation could be using 32-bits to represent anyref even while linear memory is running in 64-bit addressing mode.

Conversely, does the GC proposal put some expected upper bound on the amount of GC objects that can be addressed? If this can be >2GB, then likely it must use 64-bit pointers internally, and we might as well use i63ref ? ;) That would not be great for any remaining 32-bit CPU architectures we want to run on, but it is certainly is strange (and wasteful?) that in practice, almost all i31refs will sit inside a 64-bit pointer.

Actually, even an i63ref on a 32-bit architecture is not that expensive, since if the bit says its a pointer, it can simply truncate without a further check. It doesn't even need to load the 2nd half from memory. So maybe we should go for i63ref, if we believe 32-bit platforms are becoming less relevant for Wasm as time goes on.

@gasche
Copy link

gasche commented Jun 30, 2020

@aardappel

Yes, but the types at the edges are currently very basic types that may well not make it possible to express the full set of type feature of a language like ML, meaning separate compilation would only be "safe" if compiled from one version of the program, essentially not making it separate compilation anymore.

One could make the same argument about separate compilation when linking is done at the assembly level, or LLVM level, or JVM bytecode level, yet those are things that implementations do in practice, because whole-program compilation is extremely inconvenient in various ways. When you generate code for a library into a module, you don't know what client modules you will be linked against, so it is hard to tell what specialized instances you need to generate. It is possible of course to design whole-program three-shakers to reduce code size by propagating whole-programmation about usage, and a common practice in some communities, but that does not remove the importance in practice of also having an open-world compilation model.

@RossTate

From what I understand, [GHC does pointer tagging] because they found that algebraic operations were so common that it was more useful to embed the algebraic-constructor tag in the low bits of the pointer.

Haskell is in an exceptional situation due to the pervasiveness of laziness: many values are thunks, and GHC's evaluation model performs an indirect jump on inspection, even to find out that a value is already evaluated. The benefit of pointer tagging is not to avoid a dereference (which you need to do in most cases to get the constructor arguments anyway), but to avoid this indirect jump. The way they avoid the cost of excessive boxing is through very aggressive inlining and optimizations in general; again the "sufficiently smart compiler" approach, which I don't think should be held as a goal for language implementors.

Note that there is a lot more Haskell code written in the strict subset these days, so it is not completely clear to me that this pointer-tagging choice is still the right design. At the time it was introduced, pointer tagging provided a 5% performance improvement over a strategy without pointer tagging. At the risk of going completely into guesswork: this suggests that it is reasonable to consider implementing a Haskell runtime without pointer tagging, and get realistic performance. (You may even see interesting gains by using tagged integers, which may offset the absence of tagging.) On the other hand, when implementing a strict language, pointer tagging does not help, and not having tagged immediate values makes you fall off a performance cliff.

So the experience of Haskell developers is that either custom pointer tagging should be preferred over i31ref for better performance

I am not aware of a performance comparison between pointer tagging and immediate-integers tagging for Haskell programs.

or i30ref should be preferred over i31ref for better GC strategies.

Chicken Scheme uses more than one bit of tag for value shapes, but only immediate integers use the least bit set to 1. In this model you can have i31ref, and still there are extra tag bits available in the 0 case. (Not sure whether this would work for the Haskell runtimes you are mentioning.)

I believe that in practice, if you used much less than 31 bits for immediate values, people would not use this for an integer type (24 is too small for your main type of machine-length integers), only for other immediate values. This might be a workable compromise (in particular, maybe people want 60-plus integers nowadays anyway), but it is not completely clear what the benefits are.

@sabine

I know too little about all the features of OCaml at this point (I have been working with the compiler only since September part time and I still have a lot to learn.) to say whether full-program monomorphization is possible or not - I do not have a proof for either of these hypotheses.

Ahead-of-time full-program monomorphization is known to be impossible for languages that support polymorphic recursion, including OCaml and Haskell (but not SML, see MLton), Agda, Idris, etc. If you have a JIT you can monomorphize "on the fly", this is what the dotnet runtime does. This point was made in this earlier comment of @rossberg: #53 (comment)

@gasche
Copy link

gasche commented Jun 30, 2020

@RossTate here is another argument that you may find interesting, in terms of finding a "principled" argument for choosing one size or the other.

We are talking about splits in the data space of values that can fit one word. You may want to have tag bits/patterns for either pointers (GHC pointer tagging, or other tricks used by other implementations) or scalars (non-pointers). For example, the Chez Scheme runtime has a bitpattern for fixed-sized numbers but also for booleans, pairs (the "cons tag" mentioned in the discussion), symbols, vectors, etc. The GC only needs to know the fine-grained structure of the pointers, so from the GC perspective we may need many categories of pointers, but only one category of values (which language implementations can the subdivide with more tagging). Given that both sides may need tag space depending on the language, it makes sense to divide the space evenly between pointers and non-pointers: this is exactly the int31ref design.

@sabine
Copy link
Author

sabine commented Jun 30, 2020

@RossTate

if we do change the number of bits for guaranteed-unboxed integers, what's a non-arbitrary number to change it too?

I don't think that a non-arbitrary number of bits per se exists. In an ideal world, we would have garbage collection in hardware, and we wouldn't need to chop off bits from a hardware word in order to implement efficient garbage collection in software. However, a specification for garbage collection in hardware mostly faces the same challenges as the WASM GC spec: there being a lot of different ideas of how efficient GC should work like, with nearly every language bringing both some acquired tastes and some fundamental invariants to the table.

I agree with @gasche that, if the number of bits gets too low, we will not use this to implement integer types: the better trade-off in that situation is to implement the integer types via boxed values instead of these unboxed integers, work on optimizing that representation, and enjoying native 32-bit arithmetic.

Note, though, that the implementation of integer types is only one of the situations where guaranteed-unboxed integers are used in the data representation of OCaml. My impression is that the OCaml compiler can make good use of guaranteed-unboxed integers with less bits in most, if not all of these other situations. (I need to check back with people tomorrow to make a list of all the situations where boxing the unboxed is considered to be particularly bad and confirm if this assessment is correct.)

@gasche

Ahead-of-time full-program monomorphization is known to be impossible for languages that support polymorphic recursion, including OCaml and Haskell (but not SML, see MLton), Agda, Idris, etc. If you have a JIT you can monomorphize "on the fly", this is what the dotnet runtime does. This point was made in this earlier comment

Are there any resources that show how a simple JIT compiler that can monomorphize code that contains polymorphic recursion looks like? I do believe Andreas Rossberg when he says it's not possible to do ahead-of-time full-program monomorphization because of polymorphic recursion. I would like to understand the argument in detail, why this is the case, where exactly things cannot work when trying to monomorphize ahead of time.

@gasche
Copy link

gasche commented Jul 10, 2020

I think #94 should be fleshed out as a summarized proposal; right now people are discussing it as a single proposal while taking some points in the middle of the discussion, it is confusing/difficult to keep track of what they mean.

@sabine: I think the simplest way to extend what-I-understand-as-#94 with tagged <immtype> <reftype> would not be to have three sized regions in each structref: one for immediates (size I), one for references (size R), and one for tagged values (size T). If having three regions is not easy from an implementation perspective, one could require that either R=0 or T=0: either a block uses only pure references, or only tagged references, but not both.

@aardappel
Copy link

@sabine @gasche I would say that if tagged potentially being a different size from anyref would cause 3 rather than 2 size fields, that would be enough to prefer the simplicity of having just i31, in the context of #94.

In the context of the MVP however, (with coercion, not subtyping), allowing several scalars types could be worth it, as the cost of ani63 in a tagged is not greater than generally storing an i64. Making this a coercion seems generally saner than subtyping, and it accomplishes the goal of making only producers that use these types pay for their cost (in engines that do not use tagging already).

In fact, this modularizes the i31ref feature, to the point where you could go ahead with a GC MVP proposal that does not contain it, and make it an add-on (which may be useful if we want to debate further tagging schemes).

It will have the downside that access to the JS world is defined entirely in terms of anyref, and a JS API allows us to store an "abitrary" value, then an OCaml producer, say, cannot simply pass its i31ref as-is, and will need to box. That is probably acceptable.

@RossTate
Copy link
Contributor

@RossTate @jakobkummerow you both seem to be assuming that tagged <immtype> <reftype> is a reference type -- that it needs to be a subtype of #94's primref/pointer or the MVP's anyref.

@sabine @gasche The arguments I gave are not about types, they're about values. The purpose of using smaller scalars like i31 and i63 is entirely to let scalar values occupy spaces primarily intended for reference values, whether those spaces be the non-linear fields of a structref or wherever anyref happens in the current MVP. The costs I gave have to do with the optimized representations for references that get pushed out by scalars despite likely being more useful for more languages (possibly including even OCaml).

@sabine
Copy link
Author

sabine commented Jul 13, 2020

@aardappel For OCaml, I think it is enough in #94 to have a runtime type for the heap block that specifies the type of values in the reference array (tagged with 64 bits, tagged with 32 bits, or primref). I agree that maintaining three size fields would be too much.

Explicit coercion between anyref and a tagged reference value seems the right thing to do. The cost model here is predictable, since tagging and untagging are ALU operations.

There's nothing wrong with having i31ref as subtype under anyref, if that helps JavaScript interop, or even just to provide something that can be used right now. We can use that as a workaround, to implement a 32-bit target for the OCaml compiler on WASM while we wait for tagged to implement a 64-bit target.

Though, tagged looks to me like the more correct thing to implement in the long run, precisely because languages that do not use tagged do not need to pay for it and because languages that do use it get the 63-bit tagged that they will use in their 64-bit targets.

@RossTate

The purpose of using smaller scalars like i31 and i63 is entirely to let scalar values occupy spaces primarily intended for reference values

Ah, the purpose you assume is more specific than what we actually need:

In the OCaml heap model, we let 31-bit and 63-bit scalar values occupy spaces that can be occupied by either scalars or references.

We do not need our scalars to fit in the same space as a generic reference value.

What we do need is a type that can store both scalars and references with reasonable efficiency. tagged as a type coercible to and from anyref looks like it fits the bill.

@gasche
Copy link

gasche commented Jul 13, 2020

@RossTate I still don't understand your argument. In a design where tagged <immtype> <reftype> is available, then anyref does not need to be include i31ref, and can remain represented using 32 bits. The only assumption that we need for tagged to make sense is that pointers are 2-aligned. (This could be made statically explicit with a type aligned 2 <reftype>, but we haven't been using this in the discussion.) Then if you use tagged i31 anyref, you get what is currently anyref (in the current MVP that includes i31ref), and this can fit in a 32bit word. If you use tagged i63 anyref, you get a new type that requires 64 bits of storage space. The fact that this type can be expressed does not change any property of anyref itself. Yes, if now you start to use tagged i63 anyref heavily in your programs, those program will consume more memory than if they used tagged i31 anyref. The choice is entirely left to the producer of the code.

The existence of i64 or f64 in wasm does not mean that anyref has to be 64bit-wide, and tagged i63 anyref is just the same.

In the current MVP proposal, anyref is the only type that can efficiently represent either references or scalars, so it is the only sensible choice for a "universal type" in a language that needs a uniform representation for both scalars and values. If tagged is available, then you get several sensible choices "universal types", with different sizes: tagged i31 anyref, tagged i63 anyref, but also more precise types like tagged i31 (ref (struct i32)) or `tagged i63 (ref (struct i32)) (in the case of OCaml where blocks are known to start with a block tag). Users (or in our case, compiler authors) can choose the type they prefer for their usage.

(In my intermediate proposal (ref i31 <reftype>), you can only tag with i31, but retain the ability to use either anyref or more precise types as the reference type.)

@RossTate
Copy link
Contributor

Y'all are assuming that anyref or primref do not have anything better to do with those lower 2 bits, and so there's implicitly free space making it possible for tagged i31 ... to have room for a reference and for a scalar. What I am saying is that there might be better uses for those lower 2 bits rather than to leave room for scalars. For example, primref might want to use the bits to distinguish at least funcref and structref values in order to permit fast casting. And anyref might want to use bits completely differently to support NaN boxing.

In other words, while tagged i31 primref might make boxing integers faster for OCaml (by not boxing them), it has the cost of making either casts to structref or casts to funcref slower for OCaml and for everyone else.

@aardappel
Copy link

@sabine

There's nothing wrong with having i31ref as subtype under anyref

It means we commit to checking that bit in any anyref we want to access as ref, forever, even in Wasm engines that don't also have a JS implementation, and could otherwise store/access anyref as a naked pointer.

@gasche
Copy link

gasche commented Jul 13, 2020

@RossTate we could think about having, for example, (tagged structref funcref). In other words, maybe the strategy of exposing tagging structures in types can also fit your favorite needs.

My impression is that (ref i31 <reftype>) and (tagged <type> <type>) are interesting "alternatives to i31ref" (they feel much more realistic to me than monomorphisizing everything), and they would each design a dedicated PR to expose a design and foster further discussion.

(Again, the primref PR should really write its design down to summarize the #94 discussion. This would also help with discussing (ref i31 <heaptype>) and (tagged <type> <type>) in the context of that lower-level proposal.)

@RossTate
Copy link
Contributor

we could think about having, for example, (tagged structref funcref). In other words, maybe the strategy of exposing tagging structures in types can also fit your favorite needs.

@gasche, you seem to be describing a heterogeneous system, meaning there is no global tagging structure. But then you need structref to say what kind of reference fields it has, e.g. (tagged structref funcref), but then the structrefs in those field presumably have the same tagging convention, leading to recursive types. See this proposal for how to carry that out.

(Again, the primref PR should really write its design down to summarize the #94 discussion. This would also help with discussing (ref i31 ) and (tagged ) in the context of that lower-level proposal.)

We're working on it, but it will not have a structref type that is parameterized by the type of its reference fields. One of the benefits of the proposal is its simplicity, and furthermore that complexity doesn't solve problems without introducing recursive types (see above).

@gasche
Copy link

gasche commented Jul 15, 2020

Out of curiosity, I manually boxed an OCaml program performing integer arithmetic, in order to evaluate the performance overhead of systematic integer boxing on the current runtime. (The function I used is what I call "sumtorial", like factorial but with a sum instead of a product, basically a complex way to complex n*(n+1)/2.)

let sumtorial n =
  let rec loop acc = function
    | 0 -> acc
    | n -> loop (acc + n) (n - 1)
  in loop 0 n

module BoxedInt = struct
  type t = Int of int
  let (+/) (Int a) (Int b) = Int (Stdlib.(+) a b)
  let (-/) (Int a) (Int b) = Int (Stdlib.(-) a b)
end

let boxed_sumtorial n =
  let open BoxedInt in
  let rec loop acc = function
    | Int 0 -> acc
    | n -> loop (acc +/ n) (n -/ Int 1)
  in loop (Int 0) n

let locally_unboxed_sumtorial n =
  let open BoxedInt in
  let rec loop acc = function
  | Int 0 -> acc
  | Int n ->
    let (Int a) = acc in
    loop (Int (a + n)) (Int (n - 1))
  in loop (Int 0) n

boxed_sumtorial is an exact translation of the sumtorial definition, with boxed integers instead of unboxed integers. locally_unboxed_sumtorial is a version where I manually inlined the arithmetic constants and operations and applied local boxing/unboxing simplification.

On my machine, the boxed version is precisely three times slower than the unboxed version. The locally_unboxed_sumtorial is not much faster than boxed (3.6s rather than 3.9s on a test), because for this program it only eliminates redundant unboxing, there is no redundant boxing that can be eliminated locally. (Of course we could change the calling convention of the function to take unboxed integers, but this is not a local transformation anymore.)

@jakobkummerow
Copy link
Contributor

Quick update: as of just now, i31ref as described by the current proposal is fully implemented in V8's prototype, so you can use that for any experimental performance investigations/comparisons.

In case anyone has a demo where they suspect that significant (or at least measurable) time is being spent on taggedness-checks even though the demo doesn't use i31ref, it would be simple to create a custom build (or introduce a runtime flag) to turn off i31ref support and verify this suspicion. I won't do that until/unless someone asks me to though :-)

(Disclaimer: this is not a statement of opinion on whether i31ref should exist in the spec, or in what form. I find the flexibility of a generalized form like (tagged i31 anyref) appealing; I also like the simplicity of the current proposal (and believe that it doesn't incur an unreasonable cost at runtime); and I'd also be fine with not having it at all -- we can always add it later if needed.)

@sabine
Copy link
Author

sabine commented Jul 17, 2020

For example, primref might want to use the bits to distinguish at least funcref and structref values in order to permit fast casting. And anyref might want to use bits completely differently to support NaN boxing.

In other words, while tagged i31 primref might make boxing integers faster for OCaml (by not boxing them), it has the cost of making either casts to structref or casts to funcref slower for OCaml and for everyone else.

Ok, if I get this right, you say that having tagged i31 anyref in the spec prevents the engine from using the lower (unused) bits of a heap pointer.

I see two cases here:

  1. In case an aligned heap pointer is stored by means of the type tagged i31 anyref, this is true. 32 bits are used, there is no way for the engine to embed additional information about the heap pointer in these 32 bits at runtime.
  2. In case an aligned heap pointer is stored by means of the type anyref, the engine is still free to use lower (unused) bits in any way it sees fit.

Since we cannot type cast between tagged i31 anyref and anyref, at any point in program execution, the engine knows exactly (thanks to type annotation of the WASM program) which of these representations it is dealing with.

It looks to me like there is no difference for the GC walking the heap, when adding tagged to the current MVP proposal. It already needs to know the type of the heap struct it is walking, in order to know which fields may be pointers.

So, there are three entities here:

  • aligned heap pointer
  • value of type anyref
  • value of type tagged i31 anyref

Is there anything that prevents the engine from assuming different semantics for the implementation of

  1. aligned heap pointer represented as tagged i31 anyref (cannot store information in lower bits), and
  2. aligned heap pointer represented as anyref (can store information in lower bits)?

It looks to me like adding tagged instead of i31ref can only potentially bring a disadvantage to the users of tagged (who are likely to be willing to incur that because it enables reusing their existing compilers), but not to those who only use anyref and no tagged representations. In contrast, i31ref comes at a cost to everyone, but is less complex.

@sabine
Copy link
Author

sabine commented Jul 17, 2020

@gasche

we could think about having, for example, (tagged structref funcref)

This tagged-idea could in theory even be taken further:

tagged (ref $A) (ref $B) (ref $C) i30

tagged (ref $A) (ref $B) (ref $C) (ref $D) (ref $E) (ref $F) (ref $G) i61

@RossTate
Copy link
Contributor

So what you're describing is a heterogenous approach. Unfortunately, the current MVP is designed around a homogenous approach. For example, its compilation model for imported/exported types is designed around a universal representation. As such, you would not be able to use tagged i31 anyref as imported or exported types (without effectively changing anyref to bake in i31ref).

But if you want to go with a heterogeneous approach to enable custom low-level representations, then it's better to take things further than just tagged. For example, right now the coercion from tagged i31 anyref to anyref (with specialized low bits) would require a memory read to determine the low-bit information that was omitted from tagged i31 anyref to make room for the unboxed scalar. To prevent these kinds of efficiencies, you want a coordinated tagging scheme for your types. That's what the SOIL Initiative's proposal does.

For completeness, I should note that tagged i31 anyref can still have a cost even with your clarifications. Note that the coercion from tagged i31 anyref to anyref required a way to look at the heap contents to determine the low bits. That assumes the heap has that information. But some garbage-collection implementations don't put meta-information in the heap at all, at least for common small objects, relying on the fact that the meta-information is always available in the low-bits of the pointer. For example, if we were to add i32ref and i64ref, instead of a wasting a word in the heap for each of these objects just to say that they are i32ref and i64ref respectively, an engine could make sure that the low bits of every pointer to these objects tracks whether they are i32ref or i64ref (or other). But it can't do that if tagged i31 anyref exists, because there's not enough room in the low bits to track that information, forcing these values to take more information in the heap. (Since a funcref is a pair of a code pointer and a module-instance pointer, it too might be treated as a common class of small objects.)

@sabine
Copy link
Author

sabine commented Jul 20, 2020

the current MVP is designed around a homogenous approach. For example, its compilation model for imported/exported types is designed around a universal representation. As such, you would not be able to use tagged i31 anyref as imported or exported types

Thanks for bringing up the type imports spec, I hadn't read that in all details yet. 👍

Okay, the type imports proposal says "As far as a Wasm module is concerned, imported types are abstract. Due to Wasm's staged compilation/instantiation model, an imported type's definition is not known at compile time." (https://github.com/WebAssembly/proposal-type-imports/blob/c9700ff6267571f4a52151c8a46e800f8534f923/proposals/type-imports/Overview.md)

One paragraph later it says "However, an import may specify a subtype constraint by giving a supertype bound with the import"

If I understand this correctly, this means that all type imports/exports must be subtypes of the type any ("the type of all importable (and referenceable) types"). So, we generally cannot import/export any value types, only reference types.

Okay, now looking at this from a practical perspective for OCaml: why would we need to import/export types for tagged i31 anyref in the first place?

When importing something like a global, a function, or a table, I don't need to have a nominal type that wraps tagged i31 anyref, I can use tagged i31 anyref as a type directly, just like i32 and the other value types, right or wrong?

I could still import/export struct or array types that contain tagged i31 anyref, right or wrong?

But some garbage-collection implementations don't put meta-information in the heap at all, at least for common small objects, relying on the fact that the meta-information is always available in the low-bits of the pointer.

You mean, some garbage-collection implementations don't put meta-information in the bit-representation of the small object, but instead they put meta-information in the bit-representation of the pointer pointing to the small object?

we were to add i32ref and i64ref, instead of a wasting a word in the heap for each of these objects just to say that they are i32ref and i64ref respectively, an engine could make sure that the low bits of every pointer to these objects tracks whether they are i32ref or i64ref (or other). But it can't do that if tagged i31 anyref exists,

How does that work with arrays of i32ref and i64ref, or structs that contain them? I see that, for an individual ref i32ref or a ref i64ref, an engine could store in the pointer whether there is a 32-/64-bit scalar or a pointer stored in the pointed-to heap location, but for arrays or struct fields, the engine must store that information somewhere else?

To prevent these kinds of efficiencies, you want a coordinated tagging scheme for your types. That's what the SOIL Initiative's proposal does.

I understand that as "The SOIL initiative proposal aims to give the producer a way to influence the tagging scheme the engine uses." To achieve that, unavoidably, there is some added amount of complexity that the current GC MVP does not have (in particular, it adds complexity to the engine implementation while producers get to pick what they need from the spec). Is there potential for simplifying the SOIL initiative proposal, or do you think it cannot get simpler than it is?

Can you give feedback on the scheme I sketched in #100 (comment)? I was confused about some things, in particular, how to represent arrays. Is cases meant as a hint for the engine to use a tagging scheme to represent the different cases?

@titzer
Copy link
Contributor

titzer commented Aug 6, 2020

@RossTate I believe that forcing imported/exported types to be ref types (by requiring they are deftypes) is a stopgap to avoid the whole polymorphism issue. I believe strongly that we should not preclude true parametric polymorphism in that proposal but should design for the most general case of importing types of unknown representation. That is necessitated by embeddings that need to reference and manipulate values of host types that do not fit into Wasm's type system at all.

What about making i31ref a type constructor instead of a type? We already have ref T and nullref T, why not i31ref T?

@RossTate
Copy link
Contributor

Oops, so sorry @sabine! I let this get buried.

Okay, now looking at this from a practical perspective for OCaml: why would we need to import/export types for tagged i31 anyref in the first place?

Well ideally a WebAssembly built with OCaml would be able to export its values for others to use. If, say, you built an efficient hashmap (as a toy example), then you'd like to export your hashmap type abstractly in WebAssembly (just like you would in OCaml) along with WebAssembly functions for operating on your hashmaps (just like you would in OCaml).

You mean, some garbage-collection implementations don't put meta-information in the bit-representation of the small object, but instead they put meta-information in the bit-representation of the pointer pointing to the small object?

Yep.

How does that work with arrays of i32ref and i64ref, or structs that contain them?

There's a wide variety of techniques, and which one a GC can use depends on the invariants it can maintain. Some techniques are to always use the same tagging convention throughout the system (this is what OCaml does). Another technique is to have a descriptor at the top of any struct/array informing the GC how to work with that structure. That descriptor can be encoded as bits in a variety of ways, or point to some meta information (or even a code address to jump to).

Is there potential for simplifying the SOIL initiative proposal, or do you think it cannot get simpler than it is?

Yep. We've been waiting for there to be a proper discussion and round of feedback to determine which simplifications we should explore.

Can you give feedback on the scheme I sketched in #100 (comment)?

Ah! Even more sorry! You went through all that work and I somehow missed it entirely. I'm happy to give some feedback. Also, note that we did a case study on how to do typed functional languages here. It's handling of closures is a little outdated though, since the call-tags proposal now provides a better way to deal with the complications caused by currying.

value :=      scheme.new ((field $tag (unsigned 8) immutable) castable

Instead of using $tag here, I would recommend having value declare (extensible (cases ...)) with all the various cases of reference values you have (which should each be marked as castable). The proposal will then take care of encoding these as a tag for you, whether in the pointer or in the metadata on the heap. You'll still want $tag in tuple<n> though, for distinguishing between cases of an ADT.

I was confused about some things, in particular, how to represent arrays.

For arrays, you use indexed fields. For example,

double-array block: scheme.new (parent implicit $value)
                         ((field $tag (unsigned 8) immutable)
                         (field length $array_length (unsigned 32)))
                         (field (indexed $array_length) (gcref $value))

Besides that (and insignificant syntactic things), your sketch looks good to me.

@sabine
Copy link
Author

sabine commented Aug 21, 2020

@RossTate Thanks for the feedback! It does look like things are representable in the proposal.

Instead of RTTs, there are schemes, and it is possible to test whether a gcref belongs to a certain scheme, similar to being able to test whether a reference has a certain RTT.

Well ideally a WebAssembly built with OCaml would be able to export its values for others to use. If, say, you built an efficient hashmap (as a toy example), then you'd like to export your hashmap type abstractly in WebAssembly (just like you would in OCaml) along with WebAssembly functions for operating on your hashmaps (just like you would in OCaml).

It seems nice, from a language user's (someone who writes code in a language that compiles to WebAssembly) perspective, to have the ability to create and link modules that expose opaque values and functions that operate on them.

I don't know, if the ability to express and handle these opaque types" must or should be provided on the WebAssembly level. It looks to me like it is still an open question whether WebAssembly should provide all this infrastructure, or whether languages should invent their own abstraction on top of WebAssembly for this.

The only place where I so far know about built-in opaque values being strictly necessary is in the case of WebAssembly system interfaces, where, for security reasons, handles to operating system resources must not be modifiable by a WebAssembly program.

We've been waiting for there to be a proper discussion and round of feedback to determine which simplifications we should explore.

I would guess that, if one could get many different languages to represent their heap in that model, it should be possible to quantify which of the features are not used (or so rarely used that it doesn't make sense to implement them in a MVP).

If there was an interpreter implementation of the SOIL initiative model, that would enable some experimentation. However, that's a fairly large effort, even though the performance of the interpreter is not important at all.

Another way to look at the model is to look at every piece and try to answer the question "what would be
the effect on performance if this is removed".

rossberg pushed a commit that referenced this issue Feb 24, 2021
It's only a trap if a byte is accessed out-of-bounds, not if the
destination or source index is out-of-bounds. This only makes a
difference when the length is zero, in which case the previous behavior
would trap, and the new behavior will be a no-op.
@tlively
Copy link
Member

tlively commented Nov 1, 2022

@eqrion, you may find the opening post of this issue interesting, especially the part where @sabine says that compiling OCaml to linear memory would be preferable to having to box everything if i31 were not available.

Apart from that, most of this discussion is outdated or could be more productive in new issues now that we have folks actually working on compiling OCaml to WasmGC, so I'll close this issue.

@tlively tlively closed this as completed Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests