Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC story for strings? #145

Closed
dcodeIO opened this issue Sep 30, 2020 · 90 comments
Closed

GC story for strings? #145

dcodeIO opened this issue Sep 30, 2020 · 90 comments

Comments

@dcodeIO
Copy link

dcodeIO commented Sep 30, 2020

As of the MVP document, strings can be expressed as either an (array i8) or (array i16) per a language's string encoding, but with only one character at a time being accessible with array.get and no compatibility with JS strings I'd imagine that the MVP will not be very useful beyond pure experimentation if a language wants to target GC primarily. Think comparing, hashing, substrings, etc.

The Post-MVP document for instance mentions bulk-operations, in particular array.copy which is a start but won't help with strings, and to my surprise doesn't mention strings so far.

Hence I have questions:

  • Are there any plans yet to address strings in particular?
  • Is it reasonable to expect that Wasm strings and JS strings will be API-compatible eventually, so Web APIs can be called efficiently from Wasm and vice-versa?
@KronicDeth
Copy link

Interface Types are supposed to handle the conversion from one string encoding to another

@dcodeIO
Copy link
Author

dcodeIO commented Sep 30, 2020

Yeah, this is somewhat a continuation of my infamous thread on UTF-16 over at interface types. Give it a read if you haven't yet, it's great. There is indeed some overlap between the two, and my impression is that interface types will be useful in the absence of GC, but we'll still need a better story in the presence of GC, most likely integrating with the parts of interface types that apply to both.

To give an example: A language may chose to not rely on linear memory at all, but use GC objects exclusively, including to represent strings, and there may be the expectation that these strings can be passed to or received from Web APIs, and that dealing with such strings internally to a module is reasonably efficient. Neither interface types nor GC, in its current forms, are very useful there, unfortunately.

@RossTate
Copy link
Contributor

The GC proposal is not responsible for this kind of interlanguage interop. Different languages represent the same high-level concepts differently in their low-level representations for a variety of reasons. For example, OO languages need strings to have a v-table with references to an i-table and to reflection information, and each of those would need to be organized according the specific OO language's APIs. Some languages/libraries represent strings as trees. Then of course there are the issues with encodings. Given the diversity, it seems difficult for there to be a concept like "wasm string" that different systems could reasonably use for interop without copying or wrapping. This is why the Requirements document explicitly lists this kind of interop as a non-requirement.

@dcodeIO
Copy link
Author

dcodeIO commented Sep 30, 2020

The requirements document also states the opposite in critical success factors, and I am very surprised to see this argument being made. If we don't have a good story for strings in Wasm GC, and given that strings are by far the most common higher data structure in most programming languages (and very common to be consumed and produced by Web APIs), I'd go as far as to say that we are about to make a critical mistake endangering the success of Wasm as part of the Web ecosystem. Not tackling this will significantly undermine the viability of AssemblyScript in favor of systems languages compiling to Wasm for example, and as such some might see this as an unexcusable standards body decision. That's exactly what I was concerned about in my comments on the requirements document.

@RossTate
Copy link
Contributor

There is a middle ground. WebAssembly (and specifically GC) can provide the infrastructure for interop without baking specific interop designs into its core design. So someone can design a specific way to represent strings using GC/wasm types, and then various languages can choose to use that particular representation to facilitate interop. This includes JS, though you'll have to get the browsers to agree to do so. Or someone can design a module interface (with an abstract exported type) for string values and operations that wasm modules can link to and which each browser can implement according to how it implements JS strings. (For this to be efficient, you'll want a better compilation model though, which would be useful for a variety of reasons.)

The high-level idea is that interlanguage interop is an ecosystem problem, not a low-level problem, and it is best served in WebAssembly by developing low-level mechanisms that can support various ecosystem solutions.

@dcodeIO
Copy link
Author

dcodeIO commented Sep 30, 2020

To me the middle ground is to ensure good interoperability with JavaScript as a critical success factor, not necessarily between arbitrary languages, and I was under the impression that this is what browser makers and the overwhelming majority of the group want as well. The arguments and suggestions you are bringing up might be worth to explore, of course, who knows what we'll be getting out of it, but so far I don't see a clear merit in making this any more complicated than it needs to be.

@fgmccabe
Copy link

Concur with Ross here. The kind of interoperability you seem to be asking for is already impossible with JS. Most engines, including V8, use a multitude of internal representations for JS strings. The differences are 'papered over' at the source level.

@dcodeIO
Copy link
Author

dcodeIO commented Sep 30, 2020

I am completely stumped by the responses and will have to think about it for a bit. Might well be that betting on Wasm GC and trusting you with the changes made to the requirements document instead of following up on my points was a mistake.

@RossTate
Copy link
Contributor

@dcodeIO It would be helpful if you would more concretely illustrate the use cases you are concerned about. You mention wanting something like array.copy for strings and wanting to interop with JS strings, but these two desires seem to be contradictory given that JS strings are not represented in real engines as arrays. (Here is a summary I found of V8 string representation.) I feel like your needs might be met by having a host-implemented string library (as I suggested above) with a function for constructing a new string from the contents of a given mutable array (it has to be copied because JS strings must be immutable), and possibly another for constructing a new string from an immutable array. But without concrete illustrations of those needs, I can only guess.

@tlively
Copy link
Member

tlively commented Sep 30, 2020

There seem to be two needs at play here: 1) String manipulation in the GC proposal needs to be competitive with string manipulation in linear memory and 2) JS interop is generally important, and interop with JS strings is particularly important.

On the first point, I agree that the current MVP seems to leave a lot on the table since it can only access one array element at a time. It seems that the best solution would be to have instructions akin to the variety of MVP load and store instructions that operate on i8 arrays (or arbitrary integer/number arrays?) rather than linear memories.

On the second point, the complexity and diversity of string representations in JS engines make it infeasible and undesireable to expose the underlying structure directly to WebAssembly, as @RossTate pointed out. That leaves two possible paths for first-class JS string interop: A) a "wasm string" type is introduced with instructions corresponding to all the common string operations that can be an abstraction directly on top of an engine's native JS string representation, or B) the JS API defines a coercion from JS strings to GC types and vice versa to make passing strings into and out of WebAssembly simpler.

Both of these solutions are problematic. (A) does not fit in with the goal of having the GC proposal represent only low-level representation types and as Ross mentioned, most languages would not be able to use a native "wasm string" as their string type anyway. There's also practically very little difference between (A) and just importing a string type and functions on it from the environment, which will be possible with type imports. (B) is problematic because there is no good way to choose which representation of strings as a GC type to choose. Choosing any particular "blessed" representation would inherently favor some source languages over others, so it would be better to produce a more general solution like interface types, that can work for all languages.

@DLehenbauer
Copy link

most languages would not be able to use a native "wasm string" as their string type anyway

Actually, I think a first-class immutable string primitive that "papers over" the underlying byte encoding could still help a lot, even for languages that have a different native string representation.

For example, while C# might not be able to adopt "wasm string" as the underlying representation of System.String, C# could expose this new type as 'System.Interop.WebAssembly.WasmString' and let C# developers choose if they want to use the new zero-copy WasmString or pay the cost of transcoding to get a classic C# string.

I think option (A) is worth pursuing further.

@kripken
Copy link
Member

kripken commented Sep 30, 2020

One specific point: We are discussing strings here on the GC proposal, but the issues are separable. It's very possible strings do not fit in with the goals of the GC proposal (as @tlively just mentioned), but also that @dcodeIO 's requests are valid and should be addressed by adding a string type to wasm in another proposal.

My perspective is that we should add a string type to wasm, perhaps doing that in a separate proposal all by itself. The benefit of a string type would be similar to string types in other multi-language VMs (CLR, JVM, and for that matter JS) - they allow quick and easy passing of string data between languages. C won't use it (just like it won't on the CLR), but I understand @dcodeIO to be asking for AssemblyScript to be able to interop with JavaScript the way that TypeScript and many other languages that compile to JS can. That's a good goal!

@RossTate
Copy link
Contributor

Great points above. And I'm realizing (thanks to @tlively's post) that I overlooked a concern raised in the OP about arrays. Sorry @dcodeIO!

On arrays, one thought that's come up a few times is having linear memories within GC objects. This came up in at least #94 / #109, and in WebAssembly/interface-types#68 there is some discussion of slices, where I think it would be nice from an interop perspective for a slice to be either a view into a linear memory or a view into non-referential data within a GC object.

@Horcrux7
Copy link

I like the idea of a separate proposal for strings. In Java or equals languages you can hold the "native" string inside the language string like the native array inside an array object.

Currently a string from JavaScript can only hold as a erxternref. I can not check an equals, etc. It is like a black box.

That with Interfaces types a string must convert on every call. This will never effective.

@lukewagner
Copy link
Member

I expect even a separate "add a string" proposal is fraught with peril because there are so many fundamentally-incompatible low-level design choices with strings that will ultimately force the proposal to bias towards one subset of languages at the expense of others.

One alternative route for JS that I've been imagining is:

  • the wasm module preimports a type that it uses as the host's string type, as well as function preimports for each string operation which use the imported string type in their signatures
  • the wasm module gets preinstantiated with some WebAssembly.StringType singleton (defined by the JS API to mean "the JS string type") and JS built-ins for each string operation (adding new ones for operations that don't already exist)
  • because preimports are used, the wasm engine is encouraged to statically specialize the machine code to the specific imported values, achieving essentially the same performance as core wasm types/instructions (to a knowing engine)

The advantage of this approach is that, when targeting a Web embedding, a wasm module can use operations that precisely match JS strings while a module targeting, say, a future Go embedding could import a completely different set of Go-string-appropriate operations.

Ultimately, if your aim is to write a single module that can be used across many different hosts, I think you're going to end up with an impedance mismatch unless you make a copy at the wasm/host boundary into and out of the wasm module's own private string representation. But if you're targeting only a single host and the source language is intentionally embracing the host's string type, then the preimport approach seems pretty good.

@kripken
Copy link
Member

kripken commented Sep 30, 2020

@lukewagner

I expect even a separate "add a string" proposal is fraught with peril because there are so many fundamentally-incompatible low-level design choices with strings that will ultimately force the proposal to bias towards one subset of languages at the expense of others.

I think picking a fixed string type is inevitable, and yes that will bias to a subset of languages, but I disagree with the assumption that that's a downside!

JS, the JVM, and the CLR all have a single string type, and all show how successful that can be in letting various languages pass data back and forth easily and efficiently. If we think there's a better example, or that there's something wrong with their approach, I'd be curious to hear details about that. Otherwise I think we should learn from their success - as those VM ecosystems show, some languages will choose to use the blessed string because it's compact, convenient, and efficient, that works with a lot of other code. That's good for the ecosystem, I think.

Picking one string type doesn't prevent other languages from not using it - there is no downside to them. And we'll still want Interface Types for the case where a language wants to do strings its way and convert at the boundary. Also I agree the preimport idea is interesting too, and can help with perf issues.

@kmiller68
Copy link

JS, the JVM, and the CLR all have a single string type, and all show how successful that can be in letting various languages pass data back and forth easily and efficiently. If we think there's a better example, or that there's something wrong with their approach, I'd be curious to hear details about that. Otherwise I think we should learn from their success - as those VM ecosystems show, some languages will choose to use the blessed string because it's compact, convenient, and efficient, that works with a lot of other code. That's good for the ecosystem, I think.

One obvious downside to fixing strings to something like JS strings is that they effectively have to be encoded as UTF-16 (at least for non-ascii strings) since that's how JS indexes strings. There's no efficient way to hide that from the user without somewhat subtle perf cliffs (e.g. re-encoding the string in a different format).

@tlively
Copy link
Member

tlively commented Sep 30, 2020

I think picking a fixed string type is inevitable, and yes that will bias to a subset of languages, but I disagree with the assumption that that's a downside!

Biasing a subset of source languages is a clear downside because it directly contradicts the design goals we publish in our specification. I agree that we should learn from the experiences of the JVM, the CLR, and JS, but only when those lessons are compatible with WebAssembly's firmly established design goals.

@kripken
Copy link
Member

kripken commented Sep 30, 2020

@kmiller68

I agree that would be a downside there. A decision here would need to consider lots of factors (most of which I personally don't know enough about atm), including that.

@tlively

In general not every new feature will help every language or not help them equally. Things like Tables, exceptions, SIMD, GC, etc. may not end up used by every language, either because they choose not to or they can't - C can't use exceptions, for example, but it isn't "harmed" by us adding exceptions. Some new features may "bias" towards the languages they make sense for, but that's ok, because wasm as a whole still aims to support all languages.

To be clear, I'm not saying "let's do a string type and stop thinking about languages that won't use it!" - we'll still want Interface Types and other things in this area, as I already said.

@RossTate
Copy link
Contributor

RossTate commented Oct 1, 2020

I dug into the CLI spec and found that string is treated essentially as an abstract type (i.e. a (pre)imported type) except for one instruction: ldstr $index specifies an index into the program's string metadata and returns the corresponding string value.

@Horcrux7
Copy link

Horcrux7 commented Oct 1, 2020

@lukewagner How would a preimport change the behavior of the string type in WebAssembly? It can give some performance improvements. But it would ever of type externref. Then the usage of ref.eq or any type check will not possible.

@RossTate
Copy link
Contributor

RossTate commented Oct 1, 2020

You can preimport the (abstract) string type, and that preimport can express requirements on that type such as "must support equality".

@Horcrux7
Copy link

Horcrux7 commented Oct 1, 2020

I have no idea how I this should be possible. Also if I look into the Type Imports and Exports. But if this possible then such type would work like an internal string type.

@RossTate
Copy link
Contributor

RossTate commented Oct 1, 2020

Yes, there are are various discussions about the shortcomings of that proposal. The "pre"-imports @lukewagner and I mentioned above provide compile-time imports, enabling the importing code to be specialized to the imported types and values/functions. One application of preimports we hope for is things like this, where we can enable the ecosystem to develop a string type that performs roughly just as well as if it were baked in string type and without having to add the type and operations to the core spec of WebAssembly itself.

@kripken
Copy link
Member

kripken commented Oct 1, 2020

The more I think about the pre-import idea, the more I like it as a solution here. Would it allow something like the following (if not, maybe I'm misunderstanding something)?

  • wasm modules import an abstract string type, that is known to support "get char at index", "get length", and so forth basic operations.
  • A different pre-imported string type could lead to different behavior on the same wasm file (for example, "what is a valid character" differs between different string types).
  • On the Web, the natural thing to pre-import is the JS string type. Correspondingly, for wasm on the JVM and CLR, the natural thing would be their builtin string type. In all these cases the platform could provide the type to pre-import (so it's simple for modules to use, and not downloaded each time, etc.).
  • Pre-importing the natural string type on the Web, JVM, and CLR could lead to different behavior. Or, if that is not what a specific project wants, it could implement a specific string type in wasm and pre-import that, in which case the behavior would be identical everywhere.
  • In all the above cases, string operations are compiled to be fast.

Does that make sense @lukewagner @RossTate ?

And @dcodeIO , if the above is correct, would it give AssemblyScript all it needs?

@bvibber
Copy link

bvibber commented Oct 1, 2020

I'm not sure I fully grok the type pre-imports. If I understand correctly, a wasm module wanting to use JS strings would need to:

  • import the JS "string" primitive type (How do you refer to it on the JS end? Note there is no prototype for strings because they are not objects; String.prototype is for a wrapper class, not the primitive type)
  • import native functions for string operations (Would these be the methods from String.prototype? If so, how would you call them with no way to pass a this parameter from WebAssembly's call opcodes?)

It's unclear to me how to refer to the string type from JS, and how to call native string operations without wrapping them in JavaScript functions that shuffle an argument into this.

@sunfishcode
Copy link
Member

The need to support "get char at index" is a big part of the concern with any proposal in this space. If "char" in this case is assumed to mean "UTF-16 code unit", and "get char at index" is expected to do O(1) random access, it doesn't leave much room for implementations to have different kinds of strings in practice, regardless of whether the string type is abstract, imported, pre-imported, or anything else.

@dcodeIO
Copy link
Author

dcodeIO commented Oct 1, 2020

First of all, thanks for all the thorough comments pointing out further aspects and seeking for a solution to the problem. Appreciate it!

The type import idea is interesting in a way, in that it is about what one would do today (just not as efficient) by importing wrappers around String, and that taking it one step further leads to pre-imports, and taking it two steps further again summons that ominous universal string type on the horizon that users will ask for eventually. For instance, importing an environment specific string is in fact one solution to the problem, but compiler authors will eventually crave for a more universal way (with browsers being the common denominator) so they don't have to maintain various standard libraries, and users will become tired of duplicating and slightly modifying their entire code base to account for different environments.

Now I know about the challenges involved in creating something universal, yet my view on this remained surprisingly consistent since the interface types UTF-16 discussion, in that I still reckon that the most reasonable thing to do is to bless the family of pretty much ubiquitous UTF encodings abstracted as an universal Wasm string type, reusing the infrastructure browsers already have for expressing strings in various representations. Or, of course, something slightly more complicated ultimately achieving the same.

P.S: On a more general note: Might be that I am in a somewhat special spot due to AS's proximity to JS-land, seeing so many Web(Assembly|)Devs being left behind disappointed who initially were super excited about Wasm. Sure, there is still a constant stream of new people entering Wasm-land, but that won't continue forever and reality will hit us hard. Certainly, i31ref won't save us, but good interop can. It makes me sad to see that we are failing over something that really shouldn't be anything a bunch of the smartest theorists and programmers can't solve. Feeling tempted to close this paragraph with a famous quote by Brendan Eich, but won't, because it may be invalidated if we only do what the quote says so we can eventually add a conditional or even replace a word.

@aardappel
Copy link

A native string type (and even a host provided string module) sound like a really bad idea to me.

It risks this type being used pervasively which brings more code inside the Wasm module(s) in contact with it, which means more code locations needing to deal directly with conversions, which brings inefficiency, code bloat, and most importantly, since these unknown string representations can generate arbitrary errors, more code that needs to be error aware, and potentially has to deal with errors it wasn't tested with.

The advantage of arrays of i8 or i16 is exactly that they're binary, they can't error while in that format. In fact, they may actual non-text binary which is often useful. For an unknown string representation you have no guarantee what "bits" it is going to preserve. This will bring non-determinism into Wasm.

When it comes to strings, it's important to realize that the bulk of code just stores them or passes them on, and only a small amount of code needs to care about the actual text contained in it, and understanding unicode details. You thus want to optimize for a code structure where the bulk of code does NOT need to deal with encoding errors of implementation formats. A built-in implementation dependent string will do the opposite.

Wasm GC should not be an extension of JS, it is a low-level building block for a wide variety of languages that can choose their own representations. Let's not burden them with JS implementation details.

@kripken
Copy link
Member

kripken commented Oct 2, 2020

@Brion

It's unclear to me how to refer to the string type from JS

Yes, not clear to me either. But I assume this would be something like providing the pre-import in JS from WebAssembly.JSString, that is, the wasm JS API would make it easy to do this.

@sunfishcode

If "char" in this case is assumed to mean "UTF-16 code unit", and "get char at index" is expected to do O(1) random access,

My hope (which @lukewagner can correct) is that that wouldn't be a problem with the pre-import approach. The imported type would have a "get char" but it would be abstract and not make any claims as to speed of access. One could provide a pre-import that is O(1) or one that has O(N) or anything else (each resulting in a different program, effectively).

@dcodeIO

compiler authors will eventually crave for a more universal way (with browsers being the common denominator)

I suspect you are right, there is a vacuum here that will want to be filled with a string type. But if pre-imports can work the way I outlined, I think it can work out without speccing such a type. On the web we'll pre-import JS strings, and if someone wants to run their wasm off the web too, and have it run the exact same way, they could link and pre-import a wasm module that implements something equivalent to the JS string type (that's why I asked about this before). It's possible that that will lead - in an organic way, without any spec work - to a JS string-like type being popular in the whole wasm ecosystem. Or maybe wasm on the server (say in the CLR or JVM) will prefer another type, who knows.

@bvibber
Copy link

bvibber commented Oct 10, 2020

@dcodeIO while thinking on that I found myself wondering how important the actual string type is, and am wondering now whether a useful way of thinking is to model strings as immutable arrays of u8 or u16, which can be bridged to host strings via the Interface Types conversions.

Arrays already provide the indexed element load instruction needed to iterate or do random access, and the most important operations to optimize are bulk-memory ops like concat & slice that are common with strings.

At the Interface Types layer, raising an (array immutable u16) to a string could copy it to a JS string, but it could also be fully optimized by backing the JS string with the array's existing buffer. Likewise in the inverse direction if the JS string uses a 16-bit backing buffer (they don't always).

A UTF-8 conversion would be applied on raising/lowering from/to (array immutable u8) for JS hosts, which could cache it for reuse of the same array/string.

I think this might give me what I would want for a JS subset if I weren't willing to directly use JS strings via function imports, and if engines optimize it it's potentially very cheap when using 16-bit strings.

However there may be other benefits to a more directly shared type that's string-specific.

@dcodeIO
Copy link
Author

dcodeIO commented Oct 10, 2020

I agree that part of the challenges can, theoretically, be solved by interface types, but not all. For instance, we're still left with:

  • Avoid ecosystem fragmentation as would be introduced by separate mechanisms to use on and off the Web (when taking pre-imports into account)
  • Avoid alloc+copy->garbage at the boundary in between two Wasm GC-enabled languages and/or JavaScript (when copying remains necessary)
  • Avoid code size hits by having to handle strings explicitly (with adapter functions) at the boundary or shipping basic string functions and their dependencies with each module
  • Avoid hurting developer experience, like having to author, publish, ship and/or install a variety of adapters for/in different use cases
  • Avoid redundant re-encodings when forwarding strings through multiple modules expecting varying encodings (think npm)

@kripken
Copy link
Member

kripken commented Oct 10, 2020

@dcodeIO Thanks for writing that up!

I wouldn't object to add WTF-8 and WTF-16 types to wasm. But adding two types means you're not giving the ecosystem a clear signal, and likely you'd end up with two ecosystems of strings, which means copies between them.

Also, both of those types would not have optimal interop with strings on the CLR, JVM, or Python, for example (as your proposal says, they'd need to perform a check at the boundary - and also copy if there is an 8/16 mismatch), so I guess those are not in the 90% you expect the proposal to cover? But for many people those are very important platforms. I don't have numbers, but I'd strongly suspect the sum of those is far above 10% no matter how you look at it. (To really get something like 90%, I'd guess you need to also add UTF-8 and UTF-16 - but that just helps with some hosts, and still doesn't fully address the checks and copies between the 4 types.)

Overall I think that's a reasonable proposal, and I wouldn't have a problem with it. But the benefits seem moderate. I'd lean towards waiting on doing anything to see how the ecosystem evolves, which string types are used in practice, and how bad the overhead of copying etc. is - we can always consider adding these types later.

@dcodeIO
Copy link
Author

dcodeIO commented Oct 10, 2020

But adding two types means you're not giving the ecosystem a clear signal, and likely you'd end up with two ecosystems of strings, which means copies between them.

I don't think that it is WebAssembly's business to give a clear signal (-> to bias, especially not against JS). We had that discussion earlier, and people agreed iirc. I do not agree with your assessment that we'll end up with two ecosystems, because definitely-not-a-proposal is designed to be exactly inclusive across languages, engines and the Web using either encoding, making the bulk of languages, engines and the Web work together seamlessly.

Also, both of those types would not have optimal interop with strings on the CLR, JVM, or Python, for example (as your proposal says, they'd need to perform a check at the boundary

The check at the boundary is much cheaper than alloc+copy->garbage, since the common case is that strings are well-formed. Still a net-win. This is also noted in the document, as are the reasons for picking WTF. Indeed we should make a list of languages and the encodings these (would) use (in WebAssembly with Universal Strings). For instance, I'm not so certain that all of the CLR, JVM and Python actually/strictly require well-formedness, or if one does, that this cannot be handled more efficiently otherwise within their WebAssembly target's string implementation.

I'd guess you need to also add UTF-8 and UTF-16

I am happy to discuss a PR adding them. Needs to figures out trapping behavior. For instance, trapping behavior would immediately disqualify some languages, while picking WTF does not disqualify anyone. Previous discussions also indicated that a small set of initial encodings is preferred by some group members, but I (remember my words: I'm the good guy here) am happy to discuss and be convinced! Not gatekeeping my stuff at all.

I'd lean towards waiting on doing anything to see how the ecosystem evolves

Have been wondering if you'd make that argument actually. Interesting.

@kripken
Copy link
Member

kripken commented Oct 10, 2020

The check at the boundary is much cheaper than alloc+copy->garbage, since the common case is that strings are well-formed.

Very interesting - this may be a crucial point here! I disagree with that statement. Allocation and collection may be pretty cheap in a highly-tuned GC, O(1) (if allocation is a bump and it uses a generational GC where that item never makes it past the nursery). Whereas a check or a copy would be O(N) to traverse the string.

Maybe I'm making too much of an O(1) / O(N) difference here, as for a short string it'll all be in the cache. But imagine a loop that sends strings to the other side just to be compared to something. Optimally we'd want no O(N) operations on any of those - no copies or checks on the boundary, and the strings are interned so comparison is trivial. Languages compiling to JS have this today, I'm hoping wasm can too.

So in general, I'm worried about

  • allocation using malloc/free (close to O(1), but has fragmentation issues), but not GC
  • checking
  • copying

It may be possible to avoid the full check, for example storing a "valid" bit. But that may increase the object size and/or add VM complexity. I guess that's what you refer to here:

For instance, I'm not so certain that all of the CLR, JVM and Python actually/strictly require well-formedness, or if one does, that this cannot be handled more efficiently otherwise within their WebAssembly target's string implementation.

I don't know enough about those details, but I'm curious!

@dcodeIO
Copy link
Author

dcodeIO commented Oct 10, 2020

Allocation and collection may be pretty cheap in a highly-tuned GC, O(1)

What I can tell is that it isn't that easy in for instance TLSF, which is what I have the most experience with, but that's also more a predictable-performance MM than anything else. Other MMs typically optimize for small block sizes, and often can't guarantee the same fast operation for larger ones. But even then, if a string is small, the overhead of checking is small anyway. Regarding a GC, I guess Immix is very close to just bump allocations, but even there what you say creates fragmentation, and more fragmentation means that it has to obtain a new block from the global allocator more often, and will have to opportunistically defragment the block eventually, again at a cost. As such, oversimplifying to a comparison with an ideal bump allocator doesn't seem like a well-thought-through argument to me. More allocations, more copies and more garbage is pretty much always bad as far as I can tell. I can see if I can gather more intel on this, and if you or your coworkers can as well, that'd be great!

It may be possible to avoid the full check, for example storing a "valid" bit

Interesting idea! Might even be possible to set the bit upon construction or re-encoding of the string, making definitely-not-a-proposal even more awesome! 👍 (mind making a PR?)

@kripken
Copy link
Member

kripken commented Oct 10, 2020

As such, oversimplifying to a comparison with an ideal bump allocator doesn't seem like a well-thought-through argument to me.

There are a few more technical details, but allocation in the nursery really does become essentially a bump, and there is no fragmentation issue:

The Nursery, on the other hand, just grows until it is full. You never need to delete anything, at least until you free up the whole Nursery during a minor GC, so there is no need to track free regions. Consequently, the Nursery is perfect for bump allocation: to allocate N bytes you just check whether there is space available, then increment the current end-of-heap pointer by N bytes and return the previous pointer. https://hacks.mozilla.org/2014/09/generational-garbage-collection-in-firefox/

(of course those benefits only accrue to the nursery; heap allocation in general definitely has fragmentation issues, even in a GC language)

While searching I found this quote which I think summarizes it very well:

What's really important about generational GC is that heap allocation becomes nearly as cheap as stack allocation [i.e. O(1) - kripken]. That's a game changer. https://news.ycombinator.com/item?id=17887235 (apologies for a HN comment, but you can trust that author!)

@dcodeIO
Copy link
Author

dcodeIO commented Oct 11, 2020

Thanks, these are interesting pieces of information and I'll look into these in more detail. As a first response I can offer

Most objects will be allocated into a separate memory region called the Nursery. When the Nursery fills up, only the Nursery will be scanned for live objects.

from the first piece, so there's still some GC pressure there by filling up the nursery faster, triggering a scan of the nursery more often which has to promote objects to the tenured region more often and stuff like that. Not saying that this isn't great, because it looks impressive to me in general, but also not O(1), especially when taking into account that this is really only the ideal case. Curious how often that case triggers in an average program.

@dcodeIO
Copy link
Author

dcodeIO commented Oct 11, 2020

Gone through it now and made notes:

One (more generally) interesting aspect is that GGC assumes that copying is possible (it typically is if everything is JS with no external references), which might not be the case everywhere, especially off the web. Wondering if my observation has impact in practice, though, because copying may be hidden behind the ref types iff native code must unwrap the string first, i.e. likely copy so no references can break. If it wouldn't unwrap and copy, but reference, objects in the nursery that are live cannot be moved (technically can still be copied if the object is immutable like strings, not sure if there are other implications) without breaking the reference for instance. Thought I mention, because there might be other implications for GC algorithms off the web with this that might be interesting to talk about in other discussions.

we refer to Nursery collections as minor GC

and it also incurs some overhead during normal operation

Then, this summarizes my first impression above quite well: One reduces the amounts of major GCs to do, but still has the cost of minor GC including whatever a minor GC implies, like promoting objects or more work to do overall simply because the concept is there. Might be small, though, but it's not free.

Then there are also situations like

Consider the question of how to figure out whether some Nursery object is live. It might be pointed to by a live Tenured object — for example, if you create an object and store it into a property of a live Tenured object.

So we only care about the Tenured objects that have been modified since the last minor (or major) GC [...] In technical terms, this is known as a write barrier.

With a store buffer, the time for a minor GC is dependent on the number of newly-created edges from the Tenured area to the Nursery, not just the number of live objects in the Nursery.

indicating not immediately obvious overhead in minor GCs that are more complex than bump allocating, leading to

Also, keeping track of the store buffer records (or even just the checks to see whether a store buffer record needs to be created) does slow down normal heap access a little, so some code patterns may actually run slower with GGC.

indicating additional work to be necessary to enable the concept. and that there are other implied drawbacks again.

You are still right that

On the flip side, GGC can speed up object allocation.

but then the conclusion of the ideal-case (only) micro-benchmark still correctly (and I value that!) states

Note that this benchmark is intended to highlight the improvements possible with GGC. The actual benefit depends heavily on the details of a given script. In some scripts, the time to initialize an object is significant and may exceed the time required to allocate the memory. A higher percentage of Nursery objects may get tenured. When running inside the browser, we force enough major GCs (eg, after a redraw) that the benefits of GGC are less noticeable.

Hope this writeup, which is really just my impression from reading the pieces (and I only have good enough experience, I'm not an expert), is useful :)

@dcodeIO
Copy link
Author

dcodeIO commented Oct 11, 2020

On another note on WTF, I found this document on unicode.org, and under 2.7 Unicode Strings it says:

Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.

So seems Java and C# don't strictly need well-formedness just as JS. Perhaps it's even fair to conclude that this is rather the norm than the exception, so the WTF family of encodings makes perfect sense. Also mentions other benefits I haven't brought up yet.

@Horcrux7
Copy link

Why should there be different encoded strings? There can be different factory instructions for an abstract string type. The internal representation of the string should be a job of the host. Java use internally also different encodings for strings.

A string.len should return the length without the knowing of the encoding that was used for creating.

@dcodeIO
Copy link
Author

dcodeIO commented Oct 12, 2020

The idea behind both 8 bit and 16 bit encodings (these are the two used in practice) is a use case like this: If we specify only an 8-bit encoding, and a module using a 16-bit encoding in its language passes a string to another module using a 16-bit encoding in its language, and now the second module wants to know the length of the string in its 16-bit encoding again for instance, then we end up in a situation where the first module has to (implicitly) re-encode, and the second module has to implicitly re-encode again. That's essentially the double re-encoding problem motivating WebAssembly/interface-types#13 years ago, and it also applies if no encoding is specced but the engine picks one.

@Horcrux7
Copy link

The JavaScript language has also different implementations of a string. And there is no re-encode needed to return the length. You think too much in low level programming language. A sting is a high level object on the . It can be a list of 8 bit and 16 bit chunks and more more. A length is only an attribute of it.

Important is that a string reference is immutable.

A string reference from u8 and from u16 can be equals that a ref.eq must work between both. Also a concat must be possible.

If I need to know the encoding from the creating module then every module must re-encode it to its own encoding. Then there are exists more as one string reference type. But last not least. If I pass such string to JavaScript then the call to length() must also work without knowing of the crating encoding.

@dcodeIO
Copy link
Author

dcodeIO commented Oct 12, 2020

Noticed that you are working on JWebAssembly, which I think will have to solve very much the same challenges AssemblyScript has. If you want, I can offer to meet up with you on say Discord so we can discuss this in more detail? Curious to see how your adventure is going, and to have your ideas. (btw, seems we are both from .de)

@Horcrux7
Copy link

@skuzmich
Copy link

Agree with @dcodeIO and @Horcrux7 on a lot of points here. Wasm needs efficient immutable UTF-16-ish strings which are fully compatible with JS/DOM string in both directions. It doesn't make sense to copy strings on Wasm boundary, for languages, like AssemblyScript, Java and Kotlin, which are fine with using JS string encoding.

@jakobkummerow I wondering, is some form of this implementable in engines?

@jakobkummerow
Copy link
Contributor

@skuzmich : Sure, we obviously have support for JavaScript strings, and exposing those to Wasm would be easy. We don't have UTF-8/WTF-8 support yet, and adding that likely wouldn't be pleasant in terms of implementation complexity, but it's certainly doable. I think it mostly boils down to a spec question: what kinds of strings and string operations do we want to have in Wasm?

I agree that for JS<->Wasm interop in particular, having compatible strings would likely be very handy. I have no particular opinion on what Wasm strings should look like. Following the discussion here with interest. Given the complexities around strings, I'm inclined to think that while they feel related to the GC proposal, they probably deserve to be split out into their own proposal, just to let us focus on one thing at a time.

As an example: JavaScript, by spec, requires UCS2/WTF-16, so JS engines support that. It turns out that lots of memory is occupied by strings in many web apps, and many of those strings are one-byte strings, so V8 actually uses a one-byte ("latin1") encoding internally when possible. That makes everything much more complex, and requires copying conversions every now and then, but the memory savings are considered worth it. I see this as one data point illustrating how tough the tradeoff space is, and what kinds of ugly compromises Wasm strings may have to make in order to satisfy a number of competing goals.

@bvibber
Copy link

bvibber commented Oct 20, 2020

My understanding is that 1-byte strings in JS engines are a 1-byte subset of Unicode (eg, ISO 8859-1) which maintains the core invariants:

  • length is fixed and known
  • reading a character code by index is possible O(1)
  • the character codes are the same
  • the contents and length are immutable

Copying from a 1-byte to a 2-byte string would only be required if you were already copying it for some reason, such as a concatenation.

None of these conditions would be true of UTF-8/WTF-8 strings which seem to be proposed as the alternative to UCS-2/WTF-16.

@skuzmich
Copy link

@jakobkummerow thanks! People had concern that multiple internal representation of JS strings could prevent Wasm from efficiently using them. It is reassuring to hear that it is doable.

JS String operations that Kotlin's String class would care about are:

  • Construction from an array of 16-bit units.
  • Getting length.
  • Getting i-th 16-bit Char.
  • Checking content equality of two strings.
  • String concatenation.

Last two are easily implemented using others, but if they are optimized for some variants of internal representation, it would be nice to have them in Wasm.

@dcodeIO dcodeIO mentioned this issue Oct 29, 2020
@MaxGraey
Copy link

Btw this issue may relate to discussion about UTF16 vs UTF8 interop here:
WebAssembly/interface-types#38

@littledan
Copy link
Contributor

I think the interface types question about UTF-16 vs WTF-16 is quite separate from the question about whether, within a component, we want to allow passing strings across certain language boundaries. Here's my mental model of the situation, but I may be misunderstanding the point of components, so please correct me if this doesn't match others' intentions.

At big component boundaries, surrogate-checked UTF-16 makes sense to me, maybe with an opt-in for WTF-16 omitting surrogate checks. Coming from JS, this regularity will cost a linear-time scan, in addition to probably flattening (of the internal rope data structure) and maybe copying the string (depending what it ends up linking to). I'd be fine with including an unchecked WTF-16 domstring type initially, but I have no strong opinions, but I don't really see the rush.

Within a component, it will often be important to mix multiple programming languages, but it is even more important to avoid linear-time copies, checks, conversions between 8-bit and 16-bit string representations, and flattening of ropes. The importance is greater because there is more frequent expected back-and-forth within small modules of a component. This back-and-forth is amplified by how there can be reference circularities within a component, whereas there is a hierarchical relationship between components.

One example of such tight back-and-forth interaction which @dcodeIO provides is interaction is between AssemblyScript and JavaScript. In Igalia, we're interested in providing good cross-language support between C/C++ and JavaScript. In both cases, copying, flattening, converting between 8-bit and 16-bit, or doing a linear-time check over the string to detect missing surrogate pairs would add significant cost, due to the intention to enable lots and lots of strings crossing the boundary.

For the within-component case, the ideal solution would allow not just malformed surrogate sequences and 8-bit strings as V8 has, but also would allow JS Strings implemented as ropes (e.g., as the result of concatenation) to be passed to Wasm, possibly manipulated, and passed back, without verifying paired surrogates or even flattening the string.

To add a new type: my own personal aesthetics would be to start simpler, with a single built-in type and instructions to manipulate it, rather than the more complex pre-import strategy, but either option could be OK. (I'm OK with a built-in type being a little opinionated; I'm not sure if true neutrality is ever possible.)

I agree with others on this thread that a solution in this space is related to the Wasm GC proposal, but probably makes sense to pursue separately from this repository. In principle, such a separate proposal wouldn't actually depend on the GC proposal, since it could be refcounted, but I think we'd want to make it a subtype of externref (so, overall, it's very similar to typed function references). This new string type (or pre-import scheme) could provide a lot of value for both @dcodeIO 's and Igalia's use cases.

I suggest that we continue with the current MVP approach, to start out both Wasm GC and interface types without a solution for zero-copy, fully expressive string sharing across languages and components, while working in parallel on further proposals in this area such as the domstring interface type and a built-in Wasm string type (or pre-import scheme).

@tlively
Copy link
Member

tlively commented Apr 29, 2022

Now that we have https://github.com/WebAssembly/stringref, I'll go ahead and close this issue. Without that additional proposal, the options for strings include importing them as anyrefs or using byte arrays.

@tlively tlively closed this as completed Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests