Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the name and focus of this group from WebAssembly and binary code to WebCore etc and source code. #483

Closed
ghost opened this issue Nov 26, 2015 · 62 comments
Milestone

Comments

@ghost
Copy link

ghost commented Nov 26, 2015

I would like to make the case to the members to consider changing the focus of this group from the development of a binary code to a source code with a binary encoding. The difference might not sound significant at first but it might make a significant difference to the intent of code deployed to the web in binary format.

In the current case source code is 'compiled' or 'assembled' into the binary format and deployed in binary format. With this focus the developers might be tempted to abandon any claim to the binary encoding being related to the source, and for example move to a linear virtual machine code without expressions or structured flow control etc.

While it might be possible to 'view-source' the deployed code it might be consider 'disassembly' or 'reverse engineering' which are very loaded terms for IP.

I believe that although the operators being developed are primitive and close to the hardware, that these can still be used in a structured source code with expressions and local variables etc to make the code more readable and easier to write. A binary encoding would still be developed that would be a one-to-one reversible encoding of the source (basically a lossless compression of the source). I believe this could still be a good target for the use case of a compilation target which seems to be the current focus.

I have been working away at trying to use type derivation to help eliminate bounds checking, and there has been another recent proposal by sunfish to use some loop analysis to help eliminate bounds checks too, and while I don't have anything concrete I suspect this will be much easier to define in structured code. For example, a common case is to define a local constant variable with a typed that can be derived such as masking a value or asserting its bounds.

The new name would remove 'Assembly' and make it clear this is a source code although primitive. For example WebCore if it is not taken. The Specification language would change it's emphasis to being a source code, while still supporting the use case of being a compilation target.

Would there be any support for such a re-focusing of the group, or are the majority of people wanting a web machine code binary format to compile to?

@ghost
Copy link
Author

ghost commented Nov 26, 2015

Based on established precedent for this group, one 'lgtm' is all it takes to save the web from a binary machine code - all objections are ignored based on precedent. Please someone on the Community Group support this issue, save the web. If you are not already a member then you can join at https://www.w3.org/community/webassembly/

@ghost
Copy link
Author

ghost commented Nov 27, 2015

The name WebCore seems well taken. Another suggestion: WebBitScript. A quick US trademark search found nothing for bit-script, nor web-bit-script, but web-script was popular.

Expanding on the source compression efficiency, I believe it could be competitive with whatever wasm could achieve. The key would be to have a canonical text source style that compresses most efficiently, and any deviation might increase compressed size to remain lossless. If the producer wanted maximum compression they would firstly canonicalize the style of their source text, and this could just be a compression option to ignore non-canonical styling and text.

There could be a canonical white space source convention. The canonical text style need not be stripped of white space, rather it could use standard indentation etc.

The same principle might be used for some other source elements such as labels that might have a canonical minified format while the compression could still degrading gracefully if some labels were not in canonical form. For example the producer might choose to not canonicalize function names but to canonicalize block labels and local variables. For example the producer might choose to keep some common local variable names, such as a stack pointer etc, which might compress very well given their repetition.

The canonical form might include specific support for comments, so that the consumer knowns the difference between comments and non-canonical styling.

Multiple text source formats (JS-like, minified JS-like, s-exp, etc) might be supported, and if in canonical style then they could be converted to another canonical format without loss. Source with non-canonical text would obviously not convert without loss but could still convert to a canonical style.

Anyway, it seems quite possible to focus on a primitive 'source' code without compromising the goals of being a 'portable, size- and load-time-efficient format suitable for compilation to the web'.

The difference seems to just be what intent people want:

  • a binary virtual machine code, with a design and language that parallels compilation to native machine code, and might invoke the parallels of 'disassembly' and 'reverse engineering' which might disadvantage web user rights but advance some agendas.
  • versus a clear source code story for the web, which is my strong preference.

@kripken
Copy link
Member

kripken commented Nov 27, 2015

Overall I am sympathetic to your general position, being more on the "high-level" side myself. But I think we are close to what you propose already. Specifically, when you say

I would like to make the case to the members to consider changing the focus of this group from the development of a binary code to a source code with a binary encoding.

I think we are already doing something equivalent to that, unless I misunderstand you: we are designing an AST-based language/semantics. It will have both a binary and a text format. And the text format is described in TextFormat.md as

  • "The text format is equivalent and isomorphic to the binary format."
  • "Given that the code representation is actually an Abstract Syntax Tree, the syntax would contain nested statements and expressions (instead of the linear list of instructions most assembly languages have)."

So we are not making just a binary code, rather we are designing semantics that can be efficiently encoded in binary, but also have a direct parallel in a text format; and the semantics are described in an AST, which will naturally be reflected in the text format (which is a benefit of being AST-based).

When you suggest a "source code with a binary encoding", do you mean something more than what those two quotes imply? Apologies if I missed your point.

While it might be possible to 'view-source' the deployed code it might be consider 'disassembly' or 'reverse engineering' which are very loaded terms for IP.

A fair concern, but I believe that is why the TextFormat.md document says, as quoted earlier,

  • "The text format is equivalent and isomorphic to the binary format."

In other words, no sophisticated process or algorithm will be needed to convert back and forth between the text and binary formats, they are just different representations of each other (again, this is a benefit of our being AST-based). Also, the binary is explicitly intended to be decoded to the text format (and vice versa), and browsers will do it for view-source (as also mentioned in TextFormat.md), so such decoding is a natural and expected process, unlike reverse engineering or disassembling.

The new name would remove 'Assembly' and make it clear this is a source code although primitive.

I think that the name "WebAssembly" has been accepted by the web community, I don't recall much concern on the name itself. So I wouldn't support changing it, given that it already has significant familiarity at this point. However, I think I agree with your underlying concern, in that "assembly" describes something very low-level, which might worry some people. But I think those people saw the documentation stating that we are AST-based, and also saw the clearly AST-based current s-expression format, and therefore they saw that there isn't cause for worry. This of course once more stresses the importance of our being AST-based.

Overall I think the name "WebAssembly" is good in that it suggests both "Web" which is high-level, view-source friendly, open, and so forth, and "Assembly" which is low-level, performance friendly and compiler friendly. We do want all of those things, both "Web" and "Assembly".

In summary, as I said I am sympathetic to your concerns, but think that we are close to what you want. With one caveat, that we do not have a final text format yet, and clearly this high-level/low-level issue will play out there. I think efforts to ensure that WebAssembly ends up a good fit for the web are best focused on that. We've worked so far mostly on the core semantics, instead of either the binary format or the text format, but I do think most of us have been thinking more about the binary format while doing so. That trend worries me, and perhaps it is what concerns you as well? But, I think if we focus on making sure we get a good text format, everything will be fine. In concrete terms, I think it would be helpful to do more prototyping and discussing of text format options (I intend to do so in binaryen myself, when I can).

@rossberg
Copy link
Member

+1 to what Alon said. I tend to agree that "Assembly" in the name is a bit misleading technically speaking. But it is established by now. And it is a cute play on slogans like "JavaScript is the assembly language of the web".

@ghost
Copy link
Author

ghost commented Nov 27, 2015

Thank you for following up. My interpretation of the tone of discussions and decisions being made does not communicate a strong commitment by a core of the group to the AST.

It was recently described as an 'implementation detail', and I got no support from this core of the group for clarifying the focus. Using the term 'AST' might well be confusing to someone just starting on WASM, particularly if they know some Assembly code, but it could be worded as 'the development of a structured source code with expressions' which would have communicated this, yet there was no support for this.

I just don't think there is enough respect for this being a source code, and that this should be clearly articulated as a technical constraint and a goal.

On 11/27/2015 02:58 PM, Alon Zakai wrote:

Overall I am sympathetic to your general position, being more on the
"high-level" side myself. But I think we are close to what you propose
already. Specifically, when you say

I would like to make the case to the members to consider changing
the focus of this group from the development of a binary code to a
source code with a binary encoding.

Yes, the technical difference is not too great (at present). Why not go all-in on this being a 'source code' rather than leaving any doubt.

I think we are already doing something equivalent to that, unless I
misunderstand you: we are designing an AST-based language/semantics. It
will have both a binary and a text format. And the text format is
described in TextFormat.md as

  • "The text format is equivalent and isomorphic to the binary format."
  • "Given that the code representation is actually an Abstract Syntax
    Tree, the syntax would contain nested statements and expressions
    (instead of the linear list of instructions most assembly languages
    have)."

It reads as an after-thought. It sounds like the 'binary format' is the driver, and could be a linear machine code, and the 'text format' would just follow. 'Isomorphic' is rather vague. Why not just go all-in and state that the binary format is a one-to-one encoding of the text source (comments and white-space and labels and all), and make this a top level constraint. Make it a constraint that a valid source text file can be encoded without loss. I think this would clear up a lot of trouble, and would add only minor complexity for the binary encoding, and I would much rather be addressing some technical nits than messing with a high level constraint or goal.

So we are not making just a binary code, rather we are designing
semantics that can be efficiently encoded in binary, but also have a
direct parallel in a text format; and the semantics are described in an
AST, which will naturally be reflected in the text format (which is a
benefit of being AST-based).

When you suggest a "source code with a binary encoding", do you mean
something more than what those two quotes imply? Apologies if I missed
your point.

As noted above, I don't see this as being recognized as a hard constraint by a core of the group, and I suggest going further to make it a high level constraint.

While it might be possible to 'view-source' the deployed code it
might be consider 'disassembly' or 'reverse engineering' which are
very loaded terms for IP.

A fair concern, but I believe that is why the TextFormat.md document
says, as quoted earlier,

  • "The text format is equivalent and isomorphic to the binary format."

In other words, no sophisticated process or algorithm will be needed to
convert back and forth between the text and binary formats, they are
just different representations of each other (again, this is a benefit
of our being AST-based). Also, the binary is explicitly intended to be
decoded to the text format (and vice versa), and browsers will do it for
view-source (as also mentioned in TextFormat.md), so such decoding is a
natural and expected process, unlike reverse engineering or disassembling.

This just does not appear to be articulated as a core constraint, and as noted above 'isomorphic' is vague and it's a lossy encoding unless the source is in a rather restricted canonical form. A lossless encoding just makes things clear: clear for the developers now, clear for newcomers, and perhaps clear for the courts if it could be seen a equivalent to compressed source code (although it's been parsed too).

The new name would remove 'Assembly' and make it clear this is a
source code although primitive.

I think that the name "WebAssembly" has been accepted by the web
community, I don't recall much concern on the name itself. So I wouldn't
support changing it, given that it already has significant familiarity
at this point. However, I think I agree with your underlying concern, in
that "assembly" describes something very low-level, which might worry
some people. But I think those people saw the documentation stating that
we are AST-based, and also saw the clearly AST-based current
s-expression format, and therefore they saw that there isn't cause for
worry. This of course once more stresses the importance of our being
AST-based.

Yes, the AST was the saving-grace when I saw what was being done, yet it would seem to be wishful thinking on my part, or at least warrant articulating as a high level constraint, and I think this can be best done by making it clear that a source code is being developed with expressions. I think a part of this should be a recognition of the source as not just a compilation target, and some conveniences for writing and reading the code should be accommodated too.

Some of the arguments against my suggestion of adding block local variables verged on the bizarre such as suggestions that it would make interpreters less efficient (not able to allocate all locals on entry) yet the AST will require an interpreter to push and pop expression intermediate values anyway (unless there is an agenda to flatten the AST).

The use case of the language being a compilation target is given too much prominence. If this is the sole use case then there may be technical merit in stripping the language bare of any and all unnecessary support - the AST will go and it will clearly be a virtual machine code.

Overall I think the name "WebAssembly" is good in that it suggests both
"Web" which is high-level, view-source friendly, open, and so forth, and
"Assembly" which is low-level, performance friendly and compiler
friendly. We do want all of those things, both "Web" and "Assembly".

Assemblers produce machine code. Recovering the Assembler code is disassembly. Converting it to structured code is (perhaps) reverse engineering. It's the wrong parallel. It didn't really matter for asm.js as there was always the constraint of having the JS source, and the context helps too.

In summary, as I said I am sympathetic to your concerns, but think that
we are close to what you want. With one caveat, that we do not have a
final text format yet, and clearly this high-level/low-level issue will
play out there. I think efforts to ensure that WebAssembly ends up a
good fit for the web are best focused on that. We've worked so far
mostly on the core semantics, instead of either the binary format or the
text format, but I do think most of us have been thinking more about the
binary format while doing so. That trend worries me, and perhaps it is
what concerns you as well? But, I think if we focus on making sure we
get a good text format, everything will be fine. In concrete terms, I
think it would be helpful to do more prototyping and discussing of text
format options (I intend to do so in binaryen myself, when I can).

Sounds good, thank you.

@kripken
Copy link
Member

kripken commented Nov 27, 2015

'Isomorphic' is rather vague. Why not just go all-in and state that the binary format is a one-to-one encoding of the text source (comments and white-space and labels and all), and make this a top level constraint. Make it a constraint that a valid source text file can be encoded without loss. I think this would clear up a lot of trouble.

It's true that whitespace and labels and so forth are not 100% preserved. However, I think that is a benefit for the text format and not a downside: it gives us more freedom to represent things. The binary format will be optimized for small download and fast parsing, and might use various encoding techniques which are not necessarily good for readability of a text format. The text format will still be translatable to and from it, but it might do some simple expansions/mergings to make it more readable. At least we have the option to do this, given the current text, while if it said "1-to-1 with every detail in the binary format" then we would not.

If you feel that the benefits of being AST-based would make sense to be mentioned more prominently, then I would support that. But the question is where. In that other issue, we were discussing the title page, and I suggested we just copy the original WebAssembly overview from the W3C community page - because it's been there since the beginning, no one had issue with it, and why not be consistent with that. Perhaps there is another location, and I would be open to hear where.

But I do feel that it is already mentioned prominently - it's in the very name of AstSemantics.md, and it's stated in TextFormat.md and in FAQ.md. In other words, I think the AST aspect is well-supported in the text. But I agree that source/text format concerns have not always been prominent so far. I think the way to fix that is to focus on them now with prototyping and discussion.

@ghost
Copy link
Author

ghost commented Nov 27, 2015

On 11/28/2015 06:52 AM, Alon Zakai wrote:

'Isomorphic' is rather vague. Why not just go all-in and state that
the binary format is a one-to-one encoding of the text source
(comments and white-space and labels and all), and make this a top
level constraint. Make it a constraint that a valid source text file
can be encoded without loss. I think this would clear up a lot of
trouble.

It's true that whitespace and labels and so forth are not 100%
preserved. However, I think that is a /benefit/ for the text format and
not a downside: it gives us more freedom to represent things. The binary
format will be optimized for small download and fast parsing, and might
use various encoding techniques which are not necessarily good for
readability of a text format. The text format will still be translatable
to and from it, but it might do some simple expansions/mergings to make
it more readable. At least we have the option to do this, given the
current text, while if it said "1-to-1 with every detail in the binary
format" then we would not.

I don't think the claimed 'benefit' is real, and I see 'downside's. Can we explore this further? Could you give an example of a substantive 'benefit' to consider?

Here's the downside: I believe their would be a huge benefit for the web in the encoded source being capable of encoding the text source one-to-one, including comments and any style convention variations and named labels etc. This would mean that web users could view-source the encoded source and see the annotated code. Not supporting this seems a very significant 'downside'.

The other 'downside' to leaving this vague is that it leaves open an ongoing number of decisions that border on a high level and visible area of conflict. Making a decision that the source is encoded one-to-one addresses all these now and we can move on - the remaining challenges are technical matters of how this agreed constraint is addressed and these can be assessed on technical merit and people are probably not so fussed about small differences in the technical solutions.

Obviously it still leaves open what is valid source code, which leads into the matter of having expressions (an AST), supporting comments, and labels etc.

@kripken
Copy link
Member

kripken commented Nov 27, 2015

Sure, a concrete example is that the binary format might have only a branch and a numeric index to which location it should go, because this might be the most compressible encoding and the quickest to parse (I say "might", because data could show otherwise, of course).

And on the other hand, the text format might have a break or a continue plus a textual label, if we agree that that is overall the clearest for most people to read in view-source (I say "might" because there isn't agreement on this, #445).

The point is that the binary and text formats have somewhat different goals:

  • We need the binary format to be as small and fast as possible, because most of people on the web will encounter WebAssembly only by running a site that has it, and we best serve them by making that site load as quickly as possible.
  • But, of course, on the other hand we have web developers that need view-source, and without them, there isn't any content for people to view; to best serve web developers, we need an easy to read text format.

And those two goals, of "small and fast" and "easy to read", won't always agree. Hence I think it is useful for us to have some amount of freedom in the binary <-> text relationship. Not too much, obviously, but hopefully enough to let each format be more optimal for its particular goals. Or to put it another way, enough leeway so that details in one do not overly constrain the other - that benefits both of the formats.

@ghost
Copy link
Author

ghost commented Nov 27, 2015

On 11/28/2015 09:10 AM, Alon Zakai wrote:

Sure, a concrete example is that the binary format might have only a
|branch| and a numeric index to which location it should go, because
this might be the most compressible encoding and the quickest to parse
(I say "might", because data could show otherwise, of course).

And on the other hand, the text format might have a |break| or a
|continue| plus a textual label, if we agree that that is overall the
clearest for most people to read in view-source (I say "might" because
there isn't agreement on this, #445
#445).

This use case can still be addressed when the encoded source is a one-to-one lossless encoding of the text source. When the text source also uses only a branch and a numeric index then it will compress just as well.

Obvious there will be some overhead in the encoding to handle falling back gracefully, but I believe it will be very small, and in the limit could be just one bit. For example:

The One-bit-overhead solution: when the text source is in the most-compressible-canonical-style the C bit is set and the file is encoded as it would be under a wasm binary encoding; when the text source is not in the most-compressible-canonical-style the C bit is clear and the encoded source include two blobs, the first is the wasm binary encoding, and the second is the compresses source text.

The One-bit-overhead-per-function solution: as above, but per function or section.

I am confident that in practice we could do much better and trade off a small overhead to support a graceful degrading on the encoded source file size when the text source code is not in the most-compressible canonical style. For example adding support for named labels and variables etc, might just add a few extra opcodes or optional sections.

So this example is right in the extreme, but given an acceptance of some reasonable overhead to accommodate readable annotated source it is not a valid example.

Can you think of another example that does not depend on an extreme optimization for the encoded file size at the expense of all else?

The point is that the binary and text formats have somewhat different goals:

  • We need the binary format to be as small and fast as possible,
    because most of people on the web will encounter WebAssembly only by
    running a site that has it, and we best serve them by making that
    site load as quickly as possible.
  • But, of course, on the other hand we have web developers that need
    view-source, and without them, there isn't any content for people to
    view; to best serve web developers, we need an easy to read text format.

There seems to be a use case or goal to support a compressed encoding and fast parsing but I believe this can still be met while maintaining a one-to-one lossless source encoding.

Surely there is also a use case for supporting well annotated source code. I hear people talking about a one way 'assembly' or 'compilation', or including 'debug' info, but this casts the deployed binary as an object file.

I have a strong preference for a lossless source encoding rather than using one-way 'assembly' or 'compilation' solutions to address this use case. It makes the intent clear. It also avoids the need for separate text source files to view the annotated source.

As another added benefit for the web, if the canonical style is pretty printed then for maximum compression source code would need to be deployed in the canonical pretty printed style and could be viewed in this pretty style. There would be a disincentive to minify or obfuscate code using style, namely larger encoded file sizes.

@svick
Copy link

svick commented Nov 27, 2015

I believe their would be a huge benefit for the web in the encoded source being capable of encoding the text source one-to-one, including comments and any style convention variations and named labels etc. This would mean that web users could view-source the encoded source and see the annotated code.

How would that be a huge benefit? In the vast majority of cases, there won't be any comments, style conventions or named labels, since the WebAssembly binary code was compiled directly from another language.

And to view the code in that another language (including comments etc.), you can use source maps.

@kripken
Copy link
Member

kripken commented Nov 28, 2015

@JSStats:

When the text source also uses only a branch and a numeric index then it will compress just as well.

Sorry if I wasn't clear, I was trying to make the opposite point. Yes, we could make the text format just as compressible, as you state. But it would lose clarity by doing so, since I think higher-level control flow constructs would be preferable to the majority of developers on the web.

@ghost
Copy link
Author

ghost commented Nov 28, 2015

On 11/28/2015 10:54 AM, Petr Onderka wrote:

I believe their would be a huge benefit for the web in the encoded
source being capable of encoding the text source one-to-one,
including comments and any style convention variations and named
labels etc. This would mean that web users could view-source the
encoded source and see the annotated code.

How would that be a huge benefit? In the vast majority of cases, there
won't be any comments, style conventions or named labels, since the
WebAssembly binary code was compiled directly from another language.

This use case, which sounds important to you, would not be adversely affected. Even for this use case it would help because compilers can generated source code that includes annotations that can be very helpful when profiling or debugging.

@kripken I worry that we are talk about different points. There is a difference between the canonical source text being 'just as compressible', and the text source being able to include annotations and labels etc. That is the source text can be support being just as compressible, and support extra annotations etc for extra clarity. Whatever text format you come up with which I presume will compress well could be the canonical format and then you have all the clarity your wanted - I don't see a difference on this point? It could be a choice for the source text producer if they want to canonicalize their source for maximum compressions or retain some degree of annotations, and the encoding might fall back gracefully in relation to compression efficiency as the text source varies from the canonical style. On the matter of the control-flow, I have not seen a big issue here yet, and it seems quite possible to interpret the blocks/loop/br as high level constructs with a little pattern matching and where there are multiple interpretations (that could lead to some information loss) then some extra opcodes might be need to distinguish them. I am not proposing that the canonical source need be blocks/loop/br, or even that it should use numbers for relative labels, and certainly not that the text source language should expose all the encoding binary details.

@kripken
Copy link
Member

kripken commented Nov 28, 2015

It isn't obvious to me how to create a text format that is both maximally compressible but also has the option for sufficient extra annotations and clarity. But perhaps that is just me not seeing a solution to this that you already do. Concrete proposals for a text format that can do both would be great, of course, that's exactly what we need now.

@ghost
Copy link
Author

ghost commented Nov 28, 2015

@kripken

  1. Write a wasm binary to pretty source converter. See if the blocks can be convert to indented text source blocks, the AST expressions converted to source expressions etc, and if there are any show stoppers then file issues to have the AST changed to better support this. You could emit a br to a block end as a break, and a br to a loop head as continue, could match a loop with a test at the end and emit a do while etc. Let this be the canonical text style that compresses well.
  2. Write a parser to reverse this transform. Now so long as the text is in the canonical style it will be encoded without loss. But it loses information when not in canonical style.
  3. Extend the encoding to support the lossless encoding of all valid source, with some accepted increase in the encoded size. For example add opcodes or sections for comment text, and add support to name variables, add support for named labels, etc. This will surely decrease the encoding efficiency even when these options are not used, but I don't think it will be significant. Now there is a clear source code story for the web: the text source and the pre-parsed encoded source.

@kripken
Copy link
Member

kripken commented Nov 28, 2015

I think 1 and 2 are exactly what we should be prototyping and experimenting with now. I have hopes that we can achieve them well, but also worries.

3 is an interesting proposal. Let me try to restate it, to see if I understand:

  1. As background, we intend to have source maps, but not in the MVP. Source maps will allow people to ship a binary + a source map for the language it was compiled from, and the browser will show the original source code in view-source, including comments and whitespace.
  2. Your proposal differs from source maps in that it supports comments, whitespace, and proper names for things, for wasm itself, not a language compiled into wasm like c++. But given the clear parallel, perhaps we can call this "source maps for wasm itself"? Or, "source maps for the wasm text format" (which will have comments, whitespace, and proper names), if we see the wasm text format as a language that we compile into the wasm binary format.
  3. Your proposal suggests that we support source maps for wasm itself as a core part of wasm, already in the MVP, and not as an addition later as we plan with source maps for other languages.

Do I understand you correctly?

@ghost
Copy link
Author

ghost commented Nov 28, 2015

@kripken No, source maps cast the deployed file as a binary object file, with a clear separate and optional source file. The view that 'we see the wasm text format as a language that we compile into the wasm binary format' is part of the problem - a very unfortunate outcome. The 'Assembly' naming implies this interpretation: an assembly pass converting the source to a binary machine code. This issue is about moving away from this view, to a clear source code story for the web, and to have the Specification and the definitions within reflect this and to have groups discussions and language reflect this as a matter of professional good faith.

Source maps can still be supported for mapping back to other languages that have been source-to-source translated, but are a separate matter.

Obviously to have a clear source code story for the web the MVP needs to be source code too, and the polyfill could ignore many of the annotations (such as comments) when the size of the emitted JS is an issue or map then back to JS as an option if people want the code to be as readable as possible - their will be some loss converting between languages but many elements could be translated, even comments. This is not something that can be feature detected in future versions - it needs to be built in from the start. I am making the case for the source code story to be a core goal and a constraint on the technical outcome, not an afterthought.

@kripken
Copy link
Member

kripken commented Nov 28, 2015

The view that 'we see the wasm text format as a language that we compile into the wasm binary format' is part of the problem - a very unfortunate outcome.

What I wrote was "if we see the wasm text format as a language that we compile into the wasm binary format." In other words, I tried to present it that way in hopes of clarifying things.

the MVP needs to be source code too

The MVP has both a binary and a text format. You don't consider that text format sufficient to count as a "source code", and I surmised that it was because it lacks whitespace and labels. I therefore suggested that a concrete way to implement what you suggest could be to add those things to the MVP text format. But it sounds like I have not properly understood you yet - what more would be required, concretely?

@ghost
Copy link
Author

ghost commented Nov 28, 2015

@kripken Simply: the deployed blob should be a source code or lossless source code encoding, and consideration should be given to it being readable and writeable source code.

Adding comments and labels to the 'text format' would help address the second part 'consideration should be given to it being readable and writeable source code', but if these are not encoded in the deployed blob then it casts the blob deployed to the web as a binary code and not a first class source code which I believe would be very unfortunate for may reasons articulated above.

The MVP does not currently have a deployable text format, rather the model you noted of a text format that is compiled to a binary format with presumed loss of information.

The first step is for the group to make a decision that the deployed blob is a lossless source code encoding. Then the language in the Specification and the name of the group can be updated to communicate this clearly and the technical solution updated to meet this constraint.

@kripken
Copy link
Member

kripken commented Nov 29, 2015

The MVP does not currently have a deployable text format, rather the model you noted of a text format that is compiled to a binary format with presumed loss of information.

I would put it this way: We have a semantics, and we will have a binary format and a text format. Both will represent those semantics, and both can be converted to the other. Both are important. They are currently not intended to be converted to each other without loss.

Lossless source encoding is an interesting proposal, which could perhaps be added to the current model. I can see it adding value, but also complexity, and also it has downsides as noted before. I don't yet have an opinion on it myself. I think we could debate it more at length with a concrete pull request with that addition to the design docs, because this issue here - of changing the name and focus - is far more broad and general.

@ghost
Copy link
Author

ghost commented Nov 29, 2015

It seems to me that getting the semantics right is far more critical than the encoding of the AST.
At the risk of introducing a red herring, I would argue that it is important to avoid mistakes in the AST; such as, for example, encoding explicit class names as strings as is done in the JVM.

There are valid arguments for having a one-way encoding of the intentions of the programmer. A lot of effort is spent currently in obfuscating JavaScript.

@mkazlauskas
Copy link

If I get this proposal right, I don't see any real world benefits of having lossless encoding. Developers will want to view-source and debug code in original (highest level) language. Another point obfuscation, as @fmccabe mentioned - I believe it's a good thing to have this as an option.

@ghost
Copy link
Author

ghost commented Nov 30, 2015

@fmccabe There is certainly a lot of work specifying the semantics, and it is very very difficult getting agreement across vendors. Just stating that it will have an AST is very vague, and it could just mean a linear byte code in the extreme, and if optimizing purely for the 'compilation target' use case or 'fast interpreter' use case then dropping the AST may well be a logical and optimal solution. So what goal or principle holds the AST in place? It might be possible to ague some advantages for parsing to SSA, or encoding efficiency. The principle I propose is that some consideration be given to the source code being readable and writeable.

@fmccabe @mkazlauskas Good luck trying to sell 'obfuscation' as a well supported use case of this group to the web community. You should articulate this point in your appeal to the group to adopt a 'one-way' encoding - could you please develop this argument? I hope the web community will support me in rejecting this. I would note that if the deployed blob is a lossless source encoding then you can still obfuscate as you wish, but doing so in ways that are outside the canonical (well readable and writeable) styles will not encode as efficiently. Conversely, adding annotations will also increase the encoded blob size. I would like the group and web community to settle this point, and for the result to be articulated clearly. Lets settle it.

@mkazlauskas If the language being developed is to be a first class source code then there will be developers writing and reading it and viewing the source code in text format, and if the deployed blob encodes this source code without loss then they will be well supported too. The use case of viewing the source of translated source code will obviously still be well supported too.

@ghost
Copy link
Author

ghost commented Nov 30, 2015

@JSStats About obfuscation.

In fact, I have no need to 'sell' obfuscation. Whether this group supports it or not; there are legitimate reasons why publishers want it. Especially in a world where software patents are looking increasingly 'hard' to get. IMO, one of the key motivations for native apps vs web apps for publishers on mobile devices is exactly the ability to deploy an application without having the world's hackers being able to pick it apart. (Of course, you can disassemble a compiled C++ program, but it is expensive.)

As far as the AST is concerned, it is a relatively simple (IMO) part of the overall enterprise. Definitely, having a standard text representation is very helpful for developers. Again, a structured AST – as opposed to a SSA format – simplifies some of the analysis in interpretation. However, there is a risk with it: the temptation to lift the language to something that is closer to what regular programmers might program in.

The problem with that is that it is impossible to meet the natural requirement of all programming languages. For example, my interest is in languages like Prolog, Haskell and ML (and others). I could personally care less about C++ or Java (except professionally). Someone from the latter community would have a hard time designing a low-level language that can equally well handle C++ and Haskell. However, they can both be compiled very efficiently to bare metal.

@ghost
Copy link
Author

ghost commented Nov 30, 2015

Move from the public mailing list:

On 11/27/2015 06:41 PM, Jeff Lewis wrote:

" The difference seems to just be what intent people want:

  • a binary virtual machine code, with a design and language that parallels compilation to native machine code, and might invoke the parallels of 'disassembly' and 'reverse engineering' which might disadvantage web user rights but advance some agendas.
  • versus a clear source code story for the web, which is my strong preference."

Why are these mutually independent concepts? The idea, as I understood it, was to create the foundations for a bytecode pseudomachine based system that in the end would deliver a Java or .Net like foundation for a future web that got us out of the 'script' mindset and into more modern and sophisticated development (like - isolated libraries that couldn't stomp on each other).

You may be right, and I see this in discussions and decisions, and I believe this was the direction articulated to me over a year ago. But yet there is mention of a source format and an AST. It needs clarification and needs to be communicated clearly to group members.

'more modern and sophisticated development' seems very subjective, and there is nothing modern about a one-way compilation to a machine code. The one-way compilation solutions seem to have all been failures on the web, and perhaps this is the problem and one not to repeat again!

Perhaps we are close to something more 'modern' - a pre-parsed and compressed source code.

If you take a look at how .Net's CLR works - it really accomplishes both of these goals admirably. You can compile dynamically - which gives you the immediacy of a scripted language - in fact, you can compile IN your code for truly dynamic real time coding, and yet you can reverse compile IL (the bytecode .Net uses) back into source at any time. If you include the symbols, you can debug and even reconstruct variable names.

There have been p-codes etc dating a long way back, at least the early BASIC interpreters, but even these supported comments. My understanding is that .NET does not encode comments or source code style and is a lossy encoding, so the deployed blob is not source code.

The idea of continuing forward with a web infrastructure that still squirts tons of source code directly to a browser to be interpreted AS code seems pointless.

This is certainly not the intention, and the pipeline will not require the text source format. A pre-parsed and compressed source code will be deployed, not significantly larger than a byte code blob would be assuming the text source has been stripped.

Making it a first class source code is to support writing and reading and viewing the source, which will require decoding, and to avoid parallels with machine code and disassembly and reverse engineering etc. A text source viewer could also be smart enough to decode the source one function at a time, and so decode incrementally.

I don't actually care if it's .Net - any similar technology like Java's Bytecode - or even a wholly new such implementation would be a huge improvement.

So the proposed lossless pre-parsed compressed source code could meet your use cases too, if it had similar efficiency?

@ghost
Copy link
Author

ghost commented Nov 30, 2015

On 11/30/2015 12:21 PM, Frank McCabe wrote:

@JSStats https://github.com/JSStats About obfuscation.
...
The problem with that is that it is impossible to meet the natural
requirement of all programming languages. For example, my interest is in
languages like Prolog, Haskell and ML (and others). I could personally
care less about C++ or Java (except professionally). Someone from the
latter community would have a hard time designing a low-level language
that can equally well handle C++ and Haskell. However, they can both be
compiled very efficiently to bare metal.

That is an interesting point, can you substantiate it? What are the primitives you need exposed for these languages that are not supported by the AST, and not supported by a (at least isomorphic) text format? What constraints do your use cases place on the design?

@ghost
Copy link
Author

ghost commented Nov 30, 2015

Both Prolog and Haskell pose challenges to the current design. In some ways they also have a common requirement: a non-standard evaluation resulting in a need for more control over the representation of evaluation.

In the case of Prolog, it has an evaluation 'stack' (quotes required) that has two features not found in normal languages: a non-local return (when a Prolog program ends, its return is not necessarily near on the stack. However, that stack must still be preserved in order to support backtracking.

The second feature is backtracking. What that means is that there are two separate ways in which a program can return: successfully or unsuccessfully.

In general, a good Prolog implementation needs a lot more explicit control of its evaluation stack than languages Java/C++ do.

Haskell is a different case again. It's implementation has a number of features that are very foreign to conventional languages. In the first case, arguments are not evaluated prior to entry to functions. The effect of this is that all data looks like code. It also means that the normal array mapping of an evaluation stack is not efficient for Haskell.

In the second, there can be multiple return points to a function call: one where the result is represented as a single value, and one or more where the return result is 'postponed' with the components of the return value spread across multiple registers.

Both Prolog and Haskell require much more control over the representation of the evaluation machinery than is proposed here. This is, of course, glossing over the absolute requirement for full tail recursion optimization (the proposed mechanisms scheduled for after MVP are not actually strong enough).

One additional remark: the kind of structures needed to support Haskell and Prolog are also very good for supporting so-called asynchronous programming. So, even JavaScript and C++ could benefit from these techniques.

@abustin
Copy link

abustin commented Nov 30, 2015

I'm still trying to grasp the problem that's trying to be solved here. It sounds like you're proposing that source code obfuscation should be disallowed. Inventing a new human readable/writeable language sounds like unnecessary abstraction, limitation and complication for this project.

I'd suggest creating a new project for the proposed language that compiles to WASM and embeds the compressed "source code" into the generated elf. If the web thinks it's important to have non-obfuscated viewable source-code, then they will adopt the project.

@VargrSoft
Copy link

I'll respond to several comments here...

I think you’re wrong on several accounts – but I think it’s also because you’re looking at it the wrong way.

First, let me clarify something – the main point I was trying to make is that any bytecode based system will be better than just raw interpretation of JavaScript. I suggested .Net solely because it exists – has 15 years of experience and support – and is open source. I also suggested looking at the Java bytecode – or even rolling a new one. The important point was ‘don’t discard the advantages of a dynamic compilation based system that compiles to a bytecode machine’.

Second, there isn’t a solution for JavaScript on .NET that can get close to the performance of modern JS engines in the browser space because no one has written one. There’s never been a need for it. The closest thing would be Silverlight except it was targeted as a Flash replacement, not a JavaScript replacement.

However, Microsoft has been pitching JS/HTML as an alternative to C#/XAML for application construction for almost four years now (much to the chagrin of C# developers) and they have gotten good performance out of it. A browser would NOT have to have JS and .Net VMs. It would need a .Net VM with the appropriate Roslyn front end to do real time compiling of JS. But more to the point – to me, the idea shouldn’t be simply slapping in a new JS engine to interpret JS in some other way or worse, translate it back to regular JS – what’s the point of that? The point should be to REPLACE JS with a more productive, safer, modern system that allows for a new JS as well as other languages that can be precompiled to allow smaller payloads with more object safety and interoperability while providing backwards compatibility through language support and dynamic compiling in the browser running that code in a sandbox to limit the damage it can do.

As for the overhead - .Net and Java bytecode runs on IoT devices – far, far smaller than even the smallest browser. Heck, even Silverlight – which implemented a subset of WPF was just 4MB as a plugin – this would be integrated as a W3C standard component – so it wouldn’t even be a download. You might want to take a look at this https://www.ghielectronics.com/technologies/netmf to get a sense of just how small .Net can be. There are similar implementations for Java bytecode. At the same time, they’re extensible… you can cache binary bytecode libraries as users hit them and they’re typically small as well – so to me, this is a false concern.

I'm not sure who asked the question about how .Net handles comments - because the move to this forum 'obfuscated' the original owner's name... .Net doesn't compile in comments directly, but there are a set of special comments that get compiled into the assembly that allow for integrated documentation at the module, class and member levels. It would be possible to reverse compile them out if desired - or strip them if you want to obfusate a compiled module.

Now, onto abustin's comment. You've actually nailed a core part of my concern: compressed source code isn't a good way to handle this. It binds the design to a specific language that's meant to preserve readability in a kind of hamfisted way. It also has the effect of constraining any other language you wish to use with the system. That was a key difference between Java and .Net. Java bytecode was tightly coupled to Java. When people wanted to use other languages in a mixed environment, they ended up translating to Java and then compiling that to bytecode - which emits really inefficient code.

With .Net, they took the opposite approach and designed a 'best case' virtual CPU that was language agnostic and then wrote the languages to fit in. That resulted in literally over fifty different languages from C# and C++ to Smalltalk (S#), Pascal.Net to APL of all things - all compiling to the same IL and all interoperable. Even better - you could decompile from one to the other - Compile in C# and decompile in Pascal.Net and keep working with it.

It seems to be that this is a much more flexible solution to the 'viewable source code' problem than trying to compress or min source code or even semi-tokenizing the source code into a human readable stream.

Cheers,
Jeff Lewis

@ghost
Copy link
Author

ghost commented Nov 30, 2015

On 11/30/2015 09:24 PM, JeffLewisWA wrote:

I'll respond to several comments here...

I think you’re wrong on several accounts – but I think it’s also because
you’re looking at it the wrong way.

First, let me clarify something – the main point I was trying to make is
that any bytecode based system will be better than just raw
interpretation of JavaScript. I suggested .Net solely because it exists
– has 15 years of experience and support – and is open source. I also
suggested looking at the Java bytecode – or even rolling a new one. The
important point was ‘don’t discard the advantages of a dynamic
compilation based system that compiles to a bytecode machine’.

This is not a valid point because the proposal being developed is not 'interpretation of JavaScript' not even interpretation of a primitive source code language. The blob being deployed is pre-parsed and compressed and optimized for file size and loading and compilation.

Second, there isn’t a solution for JavaScript on .NET that can get close
to the performance of modern JS engines in the browser space because /no
one has written one/. There’s never been a need for it. The closest
thing would be Silverlight except it was targeted as a Flash
replacement, not a JavaScript replacement.

Not sure I understand this point. Some support for making the language readable and writeable is orthogonal to the other differences to JS.

The solution being developed here is not expected to reach native performance in general due to the overhead of the sandbox. For example a native language could verify the values stored into an object slot and then know the value range when read and use this to further optimize - this is not possible using only the linear memory. Also security through reachability is not possible. etc.

However, Microsoft has been pitching JS/HTML as an alternative to
C#/XAML for /application/ construction for almost four years now (much
to the chagrin of C# developers) and they have gotten good performance
out of it. A browser would NOT have to have JS and .Net VMs. It would
need a .Net VM with the appropriate Roslyn front end to do real time
compiling of JS. But more to the point – to me, the idea shouldn’t be
simply slapping in a new JS engine to interpret JS in some other way or
worse, translate it back to regular JS – what’s the point of that? The
point should be to REPLACE JS with a more productive, safer, modern
system that allows for a new JS as well as other languages that can be
precompiled to allow smaller payloads with more object safety and
interoperability while providing backwards compatibility through
language support and dynamic compiling in the browser running that code
in a sandbox to limit the damage it can do.

This is not the language being developed here. I don't even think being a base for a JS front end is a use case being considered.

As for the overhead - .Net and Java bytecode runs on IoT devices – far,
far smaller than even the smallest browser. Heck, even Silverlight –
which implemented a subset of WPF was just 4MB as a plugin – this would
be integrated as a W3C standard component – so it wouldn’t even be a
download. You might want to take a look at this
https://www.ghielectronics.com/technologies/netmf to get a sense of just
how small .Net can be. There are similar implementations for Java
bytecode. At the same time, they’re extensible… you can cache binary
bytecode libraries as users hit them and they’re typically small as well
– so to me, this is a false concern.

Java was initially targeting small embedded devices. It was re-marketed to salvage something. Many people were quite disappointed by it at the time - it's noting special. The language being developed here is also expected support lightweight consumers, and probably much light as there is no object support and no runtime library baggage.

I'm not sure who asked the question about how .Net handles comments -
because the move to this forum 'obfuscated' the original owner's name...

I noted that .Net does not encode comments - interesting to hear it might. I did email you in private when moving the discussion here, and the message did include my name. Sorry if this was not clear.

Now, onto abustin's comment. You've actually nailed a core part of my
concern: compressed source code isn't a good way to handle this. It
binds the design to a specific language that's meant to preserve
readability in a kind of hamfisted way. It also has the effect of
constraining any other language you wish to use with the system. That
was a key difference between Java and .Net. Java bytecode was tightly
coupled to Java. When people wanted to use other languages in a mixed
environment, they ended up translating to Java and then compiling that
to bytecode - which emits really inefficient code.

There is no difference between the proposed lossless pre-parsed compressed source code, and the current lossy pre-parsed binary encoded code, from the point or view of 'binds the design to a specific language' or 'constraining any other language you wish to use with the system'. It just allows the deployed code to be a readable and writeable source code.

With .Net, they took the opposite approach and designed a 'best case'
virtual CPU that was language agnostic and then wrote the languages to
fit in. That resulted in literally over fifty different languages from
C# and C++ to Smalltalk (S#), Pascal.Net to APL of all things - all
compiling to the same IL and all interoperable. Even better - you could
decompile from one to the other - Compile in C# and decompile in
Pascal.Net and keep working with it.

.Net (CLI) is a stack based linear virtual machine code. Far from the conveniences of a readable a writeable source code. I don't want web users having to 'decompile' to view the source code.

It seems to be that this is a much more flexible solution to the
'viewable source code' problem than trying to compress or min source
code or even semi-tokenizing the source code into a human readable stream.

With .Net a byte code is deployed - it's not a source code.

I largely support the model being developed here, the linear memory, the primitive operations, the lack of object support, etc, but I would like the deployed code blob to be a clear source code and I really don't think it compromises performance or deployed code size or substantively change the AST.

Regards
Douglas Crosher

@JeffScherrer
Copy link

@JSStats I'm not sure how Native Client has failed other than Google maybe not having any new announcements on it anymore. I've seen negative feedback on it. But the sources and nature of that feedback is questionable.

If Microsoft has invested in this approach and has vetted it enough to make it the backbone for Windows 10 Universal Apps, it's possible they've understood how to overcome some of the shortcomings people are mentioning about Native Client.

But let's think about this seriously. Java and .NET VM/JIT technologies have existed for a really long time. Why are we seeing changes to a more native approach? Is it possible that all of the smart people working on synonymous technologies at Microsoft and Google have made the same revelation? Which is that they will never achieve native performance without compiling to a native output? By not going with such an approach, do we automatically give up the ability to have native performance? Or do we make it extremely difficult for ourselves to achieve such performance? The main thing that stands out to me in the high-level goals is the 1st item in the list mentioning "native speeds". Is this something everyone is serious about?

@kripken Can you point me in the direction that mentions these philosophies? I've read a lot of the documentation on here, what seems closest is the high-level goals. While the high-level goals is very helpful, it seems to be more of a feature-level set of business requirements.

A list/matrix of the different approaches, as well as their pros and cons would be extremely helpful for everyone to fully understand what's being done and how to collaborate better. Again, if I'm just not seeing this, please point me in that direction.

@abustin
Copy link

abustin commented Dec 1, 2015

The native code path was tried and seems to have failed on the web - namely Native Client

Calling NACL a failure is somewhat hyperbolic. It's a proven, secure solution for deploying apps written in C, etc. It's the vendor adoption that's held it back and made it a perceived failure. WASM (IIRC) is the compromise between asm.js and NACL that all vendors are aligned around. I'd personally prefer mass adoption of NACL or something similar.

I also find the idea that a binary format will cause the end of the web is also overblown. Why must a text format prevail in order for the web to keep functioning? W3C literature seems to say use the best encoding for the situation .. http://www.w3.org/People/Bos/DesignGuide/binary-or-text.html.

I deal with many low end devices that spend a lot of time in the "parser" phase of plain text code. I look to WASM to remove that unnecessary parser step.

@kripken
Copy link
Member

kripken commented Dec 1, 2015

@JeffScherrer see also the FAQ, which mentions PNaCl and LLVM IR, as alternative approaches that were considered. HighLevelGoals mentions portability as a core concern, which rules out the NaCl approach of building to each architecture (PNaCl also dropped that part of NaCl). There is also Rationale, but it's more specific.

I agree it might make sense to write up a comparison against .NET and the JVM, those specifically don't seem to be written up. Briefly,

  • JS perf on such VMs is not comparable to native JS VMs, despite plenty of research in the area. (The best is Truffle/Graal on the JVM, but even there, startup is a major concern.)
  • Concerns with running C++ at full speed on such a target (e.g., no unsigned on Java, indirect calls working differently, less low-level access in various ways, etc. - I don't know of a solid answer to this).
  • GC concerns with differences between JS and other VM's GCs. To avoid those, the idea is that in WebAssembly, we run inside a JS VM. We've shown already such VMs can run compiled C++ at native speed minus the costs of portability and sandboxing.

@ghost
Copy link
Author

ghost commented Dec 1, 2015

@JeffScherrer @abustin Yes, NACL might have failed in part because it could not get multi-vendor support, sorry. But why the resistance from other vendors? Might it have been that they have different security models and implementations and thus need something a little more abstract? That they want to keep their options open for future low level changes and improvements that would be difficult with NACL? A key advantage I see in the asm.js/wasm approach is that it adds an extra step giving a little more flexibility here.

@abustin The proposal is still to deploy a pre-parsed compressed binary blob that is quick to parse and compile, just that it also be a lossless source code encoding. This issue is a little more subtle than text versus binary. I am not claiming a binary deployment of a stripped code 'will cause the end of the web', rather I just don't think it is the best outcome and it seems entirely unnecessary technically. I believe the onus should be on those who insist on stripping the deployment language to justify why this is necessary and to justify this extreme position?

@kg
Copy link
Contributor

kg commented Dec 1, 2015

@JSStats

... The proposal is still to deploy a pre-parsed compressed binary blob that is quick to parse and compile, just that it also be a lossless source code encoding. This issue is a little more subtle than text versus binary. I am not claiming a binary deployment of a stripped code 'will cause the end of the web', rather I just don't think it is the best outcome and it seems entirely unnecessary technically. I believe the onus should be on those who insist on stripping the deployment language to justify why this is necessary and to justify this extreme position?

This is a sensible proposal with obvious advantages, but the implied costs are significant. Extensive Research into compression ratio/decode speed characteristics has occurred already - and this is just the public wasm-focused work. We're leveraging an existing body of research into binary encodings for native instruction sets, trees, and other forms of data. We're also leaning heavily into knowledge about modern stream compressors and knowledge about how web applications are shipped and run on existing devices. Applying all this knowledge to text parsers and textual languages is just hard.

People have achieved incredible gains in size & decode speed for existing text languages, but the fact is that it is difficult to make text parsing as fast as a binary decoder. Even a poorly-designed, lazily implemented binary decoder can outperform a clever text decoder in terms of speed and heap usage - both of which are important considerations for the web, especially on resource-constrained platforms like mobile.

The existing body of research does not support the idea that it would be trivial to ship an efficient representation for decoding/compilation that also retains all the source-level information such that you can round-trip things like comments & variable names. One could build that format, but you wouldn't deploy it any more than people deploy large blobs of hand-written JS in production. It is certainly the case that some executable formats (i.e. java class files or .net assemblies) retain more information than others, but I'm not aware of any deployed real-world example of your proposal.

A file format cannot truly be all things to all people. At present, wasm is focused on solving some specific problems, and adding additional major objectives like round-tripping of arbitrary source risks compromising the most important objectives.

@mkazlauskas
Copy link

I agree it might make sense to write up a comparison against .NET and the JVM, those specifically don't seem to be written up.

It might be worth adding Dart VM to this comparison. It's pretty much designed for the web and already has solutions to many problems (e.g. snapshots).

@ghost
Copy link
Author

ghost commented Dec 1, 2015

On 12/01/2015 07:23 PM, Katelyn Gadd wrote:

@JSStats https://github.com/JSStats

... The proposal is still to deploy a pre-parsed compressed binary
blob that is quick to parse and compile, just that it also be a
lossless source code encoding. This issue is a little more subtle
than text versus binary. I am not claiming a binary deployment of a
stripped code 'will cause the end of the web', rather I just don't
think it is the best outcome and it seems entirely unnecessary
technically. I believe the onus should be on those who insist on
stripping the deployment language to justify why this is necessary
and to justify this extreme position?

This is a sensible proposal with obvious advantages, but the implied
costs are significant. Extensive
https://github.com/WebAssembly/polyfill-prototype-1 Research
https://github.com/WebAssembly/js-astcompressor-prototype into
compression ratio/decode speed characteristics has occurred already -
and this is just the public wasm-focused work. We're leveraging an
existing body of research into binary encodings for native instruction
sets, trees, and other forms of data. We're also leaning heavily into
knowledge about modern stream compressors and knowledge about how web
applications are shipped and run on existing devices. Applying all this
knowledge to text parsers and textual languages is just /hard/.

I really am interested in the work you have done in this area and the binary encoding to be developed, but I seem to be failing to communicate some key point here because I am not proposing to deploy a text source code so the points you make against this are just not relevant.

The existing body of research does not support the idea that it would be
trivial to ship an efficient representation for decoding/compilation
that /also/ retains all the source-level information such that you can
round-trip things like comments & variable names. One could build that
format, but you wouldn't deploy it any more than people deploy large
blobs of hand-written JS in production. It is certainly the case that
some executable formats (i.e. java class files or .net assemblies)
retain more information than others, but I'm not aware of any deployed
real-world example of your proposal.

It seems obvious to me that this could have a low burden. I am sorry if I have been unable to communicate this. Perhaps we can revisit it as the binary format is developed and I can demonstrate how it can support annotations and labels with a small cost to the encoding.

Has anyone in the group been able to follow this technical point and thus be able to support it?

For example, that adding optional function header comments, and adding opcodes for statement level comments and line end comments, and supporting named labels, would not be a significant burden to the encoding and would not significantly increase the blob size if not used.

@kg
Copy link
Contributor

kg commented Dec 1, 2015

@JSStats I think I may have been confused by the emphasis on textual formats and other details. The specific thing you call out - optional opcodes for metadata like comments - is definitely possible and would fit naturally into the wasm extensibility model. There are some details that would need to be addressed, like how to ensure that the metadata opcodes do not break decoding in implementations that don't understand them. That might be as simple as using the polyfilling mechanisms to define them as a no-op. Encoding metadata as opcodes is something that would work in both textual and binary representations of the format, as it's an AST consideration.

The above is still distinct from taking textual sexpr comments and round-tripping them through a binary encoding, however. I was under the impression that you are interested in round-tripping the textual representation with complete fidelity, including whitespace, label/symbol names, and comments.

@lih
Copy link

lih commented Dec 1, 2015

It may be sophistry on my part, but a textual format is also always a binary format, so the problem still remains the same : what binary format should we choose ? For that reason, I believe the name WebAssembly to be better suited to describe the true nature of this endeavour than WebScript or WebSource.

I like text, but what's the point of having a binary format if the browser still has to parse, optimize and JIT-compile the code ? The binary format should allow the parsing and optimizing phases to be moved out of the client, leaving only the JIT-compiling to be done before running the code (i.e. make the client do as little as possible).

Adding optional source annotations could be a plus for people who like to use their browser as a debugger, but let's realize that the infinite majority of clients just won't use that feature (when's the last time your mother clicked "view source" when a problem came up on a site ?). Plus, the obfuscators out there could just obfuscate their code before compiling it anyway, so a "clean source code story" would be dependent on everyone's good will, which is not really an improvement over the current system.

If, despite all that, you still really want a canonical source format that can be translated to different viewing conventions, while adapting to every language feature out there, then what you want is a Lisp. It was designed to be as syntax-agnostic as can be, while still being directly mappable to any semantic thanks to its macros. All you need is a list of frontend parsers for all supported languages and you're good to go.

Personally, I'd rather have a browsable binary format that's portable and read-efficient, even if it isn't editable like a textual representation. Text is too sensitive to small details like whitespace conventions, syntactic sugar and encoding issues to provide a solid base for a universal standard, IMHO.

PS: I am in the process of writing a compiler that uses such a format for its object files, so it is definitely not impossible to design :-D

@ghost
Copy link
Author

ghost commented Dec 1, 2015

@kg Great, perhaps I see a little light in the response. I think it can be extended to give the lossless encoding, but at this point it's too hard making the case so I'll defer this until it can be demonstrated. Bring on the binary encoding, and @kripken's text format.

@lih I think the key is a lossless translation. Compilation and assembly are generally lossy. This issue is only addressing the deployed code language - developers are still free to strip their code but I just don't want this to be the only option.

@lih
Copy link

lih commented Dec 1, 2015

@JSStats in that case, let's make source annotations optional at compile-time (like gcc -g), so that developpers can debug their code when testing it on their browser, then compile a release version without the extra symbols weighing it down.

I agree that the source should be accessible at all times (free software FTW), but since most clients will never need it, it should be kept separate from the binary for performance. In cases where you need source informations, we could require that the stripped binary offer a link to the annotated version of the same program or a source map, so that the browser could seamlessly switch to "debug mode" if asked.

@ghost
Copy link
Author

ghost commented Dec 1, 2015

@lih This issue only addresses the deployment language, and I don't want to see an 'assembly' or 'compilation' from a deployment text format to the deployment binary blob, and separating it into a binary and source would make the binary a lossy version of the source. Developers might as well just have two deployment blobs, one with annotations and one stripped if they use both to keep things simple. This issue is not about the use case in which the deployment language is a compilation target, and in that case there are planned to be source maps and the source will be separate from the deployment language - but this would be a separate matter for discussion in another issue.

@lih
Copy link

lih commented Dec 1, 2015

@JSStats There's bound to be some compilation phase on the client side to obtain a native binary, and I don't really see how compiling text is less objectionable than compiling a pre-parsed binary. Reading back, you seem to equate "binary" with "imperative and unstructured", which could explain our misunderstanding. I'm not suggesting a binary format containing some hard-to-analyze sequence of pseudo-instructions, but rather a structured binary format that describes an annotated optimized AST of the source, along with other metadata.

If I understand correctly, you'd like developers to be able to write in the deployment format directly by hand, so it has to be text. I like the immediacy of scripting in the native language of my environment, but there's no reason that the server couldn't recompile the source every time it's modified before sending the resulting blob when clients hit "refresh". That way, the immediacy would be intact, and everyone could use the textual format they prefer.

Whatever the format though, if you allow optimizations to be performed on the server-side at compile-time, then even though you could theoretically turn an optimized program back into textual form, it would be a garbled version of the original, and mostly useless to humans. Essentially, bidirectional translation between a blob and a readable source disallows any sort of "macro expansion" between the original source and the deployment format. That's why I suggested that blobs should provide a link to their original source instead. That way, you can optimize the program and still understand it if necesary.

PS: you're right, the binary would contain less information than the source, but since it would also contain a link to said source, then no information would be lost, and the compilation process wouldn't be "lossy".

@ghost
Copy link
Author

ghost commented Dec 1, 2015

On 12/02/2015 04:59 AM, Marc Coiffier wrote:

@JSStats https://github.com/JSStats There's bound to be /some/
compilation phase on the client side to obtain a native binary, and I
don't really see how compiling text is less objectionable than compiling
a pre-parsed binary. Reading back, you seem to equate "binary" with
"imperative and unstructured", which could explain our misunderstanding.
I'm not suggesting a binary format containing some hard-to-analyze
sequence of pseudo-instructions, but rather a structured binary format
that describes an annotated optimized AST of the source, along with
other metadata.

The 'compilation phase' is down stream of the deployment language. This issue is only about the deployment language. We seem to both agree that the deployed code will be 'an annotated optimized AST of the source, along with other metadata.'

There is still the matter of what principle holds the AST in place, particularly if it were found to be non-optimal for some technical challenge. Just stating that there is an AST is very vague and in the limit this could mean a linear machine code. The principle I propose is that some consideration be given to it being readable and writeable? Do you have some suggestions?

If I understand correctly, you'd like developers to be able to write in
the deployment format directly by hand, so it has to be text. I like the
immediacy of scripting in the native language of my environment, but
there's no reason that the server couldn't recompile the source every
time it's modified before sending the resulting blob when clients hit
"refresh". That way, the immediacy would be intact, and everyone could
use the textual format they prefer.

No, I suggest a text source code format and an encoded source code format that can be translated without loss. The encoded source code would be deployed. I expect machine producers will encode directly to the encoded format when efficiency is important. I expect developers to use a text editor to read and write the source code which would be translated to and from the encoded source code blob without loss.

There is no need for the deployed blob to be a lossy stripped version of the text source format. Do you have a reasonable reason why this must be the case?

Whatever the format though, if you allow optimizations to be performed
on the server-side at compile-time, then even though you could
theoretically turn an optimized program back into textual form, it would
be a garbled version of the original, and mostly useless to humans.

This is not the model proposed. It is true that a goal of wasm is to offload some of the optimization to the producer but this is reflected in the primitive operators and computation model. I don't want to see a deployment text source code that is 'server-side optimized' into the deployed blob - any of these optimizations can be upstream operations that can target a deployed language.

Essentially, bidirectional translation between a blob and a /readable/
source disallows any sort of "macro expansion" between the original
source and the deployment format. That's why I suggested that blobs
should provide a link to their original source instead. That way, you
can optimize the program and still understand it if necessary.

No, this issue is only about the deployment language. It has nothing to say about macro expansion between some upstream language and the deployment language. I worry this point has not been received.

PS: you're right, the binary would contain less information than the
source, but since it would also contain a link to said source, then no
information would be lost, and the compilation process wouldn't be "lossy".

This casts the deployed blob as not being a first class source code. It fails to degrade gracefully in size with the amount of annotation. I don't believe this is technically necessary. What reason is there to enforce this model on the web?

@sunfishcode sunfishcode added this to the Discussion milestone Dec 2, 2015
@jbondc
Copy link
Contributor

jbondc commented Dec 3, 2015

Some thoughts:

  • The name WebAssembly sounds fine to me.
  • Love that there's a text syntax out there 😄
    https://github.com/WebAssembly/build-suite/blob/master/emscripten/hello_world/src.cpp.o.wast
  • Share some concerns about the binary format too. It's an optimization and caching problem. That's something that will be difficult to agree on. Consider that a new machine we never thought could exist comes out, is the binary format still efficient?
  • I expect people to write obfuscators on the text assembly anyways so don't think a standard binary version provides much value at protecting source code.

@ghost
Copy link
Author

ghost commented Dec 3, 2015

On 12/04/2015 06:27 AM, Jon wrote:

Some thoughts:
Love that there's a text syntax out there 😄
https://github.com/WebAssembly/build-suite/blob/master/emscripten/hello_world/src.cpp.o.wast

While this text source code is intended to be 'isomorphic' to the deployed encoded code it does not preserve comments or named labels or white space. It is a work flow model in which a text Assembly source is assembled to a binary blob for deployment.

Wouldn't it be nice to be able to deploy the source code, with annotations if you chose to, rather than only being able to deploy a stripped binary?

Wouldn't it be nice to be able to view the source code as deployed without firstly clarifying if 'disassembly' or 'reverse engineering' to the deployed text source code were legally allowed? There are no flags on the binary to indicate permitted uses, and no support for comments to include a license - you would have to contact the author via a separate channel.

Wouldn't it be nice if as a distributor of software that can view the source in text format that you had a little less to worry about on the legal side because the encoded source was a craft-less lossless encoding of the text source code without a parallel to 'disassembly' or 'reverse engineering'?

For education purposes, or productivity, wouldn't it be nice to be able to distribute a single runnable encode source file and to allow students to view the text source code with explanatory annotations without separately distributing a text source file and explaining the work flow needed to generate runnable code?

I expect people to write |obfuscators| on the text assembly anyways
so don't think a standard |binary| version provides much value at
protecting source code.

This issue is not about stopping developers stripping their code before deployment, it has nothing to saw about upstream text source code generation, rather it is about not making this the only option.

I see a small disincentive to using minification to obfuscate, because the AST does not really care about white space - that's about it.

@lih
Copy link

lih commented Dec 4, 2015

@JSStats

The 'compilation phase' is down stream of the deployment language. This issue is only about the deployment language. We seem to both agree that the deployed code will be 'an annotated optimized AST of the source, along with other metadata.'

Technically, there are two phases to compilation : building and linking. Seeing as the deployment language is what allows the both to be separate, I'd say compilation is part of the problem.

There is still the matter of what principle holds the AST in place, particularly if it were found to be non-optimal for some technical challenge.

If a technical challenge arises that can't be well expressed as an AST, what would its source look like (since it's also an AST) ?

Just stating that there is an AST is very vague and in the limit this could mean a linear machine code.

It could, but so can textual source code represent an unstructured series of instructions. There are degenerate cases in all tree-like representations, that doesn't mean they are vague.

The principle I propose is that some consideration be given to it being readable and writeable? Do you have some suggestions?

I do. Making a binary format readable is a simple matter of finding the appropriate visual representation. Making it writable requires writing a "syntax editor" to be used instead of text editors (something like Epsilon, an old proof-of-concept of mine).

No, I suggest a text source code format and an encoded source code format that can be translated without loss. The encoded source code would be deployed. I expect machine producers will encode directly to the encoded format when efficiency is important. I expect developers to use a text editor to read and write the source code which would be translated to and from the encoded source code blob without loss.

Why do you expect developpers to translate from the encoded source blob when they already have the source ? It seems much more likely that they will indeed read and write the source, but only write to the encoded blob.

There is no need for the deployed blob to be a lossy stripped version of the text source format. Do you have a reasonable reason why this must be the case?
...
This casts the deployed blob as not being a first class source code. It fails to degrade gracefully in size with the amount of annotation. I don't believe this is technically necessary. What reason is there to enforce this model on the web?

Simplicity and efficiency, if nothing else. Since the vast majority of clients are not going to read the source, shipping it all the time would be a huge waste of space. Additionally, we can design a binary format that is easy to parse in order to reduce the clients' workload when they need to run it (offloading some of the strain on the server instead). Baking in a link to the source (as I suggested) in every binary wouldn't waste so much space and would enable every good thing about the source to be at hand when it is needed.

I guess what I'm saying is : if you send something to a client, don't presume that they need everything you can offer. Most of the time, they just need to run the app, so just send the the app.

This is not the model proposed. It is true that a goal of wasm is to offload some of the optimization to the producer but this is reflected in the primitive operators and computation model. I don't want to see a deployment text source code that is 'server-side optimized' into the deployed blob - any of these optimizations can be upstream operations that can target a deployed language.
...
No, this issue is only about the deployment language. It has nothing to say about macro expansion between some upstream language and the deployment language. I worry this point has not been received.

What do you mean ? If your optimizations target a deployed language, in what sense are they not "server-side" ?

And if there is macro expansion (and/or optimization) between the upstream language and the deployment language, then how can you rebuild the original (pre-macro-expansion) upstream source from the deployed blob ?

Wouldn't it be nice to be able to deploy the source code, with annotations if you chose to, rather than only being able to deploy a stripped binary?

Who said anything about only allowing stripped binary deployments ? Quite the contrary, we seem to be in favor of choice : binary for the most frequent cases, and source deployments for the few curious souls. Source-only deployment is what we already have with JavaScript, why make the same mistake ?

Wouldn't it be nice if as a distributor of software that can view the source in text format that you had a little less to worry about on the legal side because the encoded source was a craft-less lossless encoding of the text source code without a parallel to 'disassembly' or 'reverse engineering'?

Disassembly and reverse engineering are not a problem, nor should they be thought of as a crime. They are merely ways to understand how a program works. What people do with that information, that is where the crimes are, and giving away the source won't stop them from misusing that information if they truly want to.

For education purposes, or productivity, wouldn't it be nice to be able to distribute a single runnable encode source file and to allow students to view the text source code with explanatory annotations without separately distributing a text source file and explaining the work flow needed to generate runnable code?

Seems more like a matter of tooling than standards. Give students the source (with explanatory annotations), give them a tool that compiles and runs it, and they won't need a workflow at first. When they are more experienced and start to write multiple modules, teach them how separate compilation works and teach them the full workflow. By that time, they won't mind the small extra complexity.

@jbondc

Love that there's a text syntax out there 😄
https://github.com/WebAssembly/build-suite/blob/master/emscripten/hello_world/src.cpp.o.wast

^^ indeed, that is a great syntax. I wonder where it comes from...

Share some concerns about the binary format too. It's an optimization and caching problem. That's something that will be difficult to agree on. Consider that a new machine we never thought could exist comes out, is the binary format still efficient?

Good question. If the binary format is equivalent to the source, then such a machine would make the source inefficient as well, which would prompt some rewrites, I think.

I expect people to write obfuscators on the text assembly anyways so don't think a standard binary version provides much value at protecting source code.

It doesn't, because nothing can. If I want to garble my source before sending it to the compiler, or the browser or whatever, then pretty much nobody can stop me (including file formats).

@ghost
Copy link
Author

ghost commented Dec 4, 2015

On 12/04/2015 10:39 PM, Marc Coiffier wrote:

@JSStats https://github.com/JSStats

The 'compilation phase' is down stream of the deployment language.
This issue is only about the deployment language. We seem to both
agree that the deployed code will be 'an annotated optimized AST of
the source, along with other metadata.'

Technically, there are two phases to compilation : building and linking.
Seeing as the deployment language is what allows the both to be
separate, I'd say compilation is part of the problem.

Again this issue has nothing to say about upstream processing. The 'compilation' you refer to is an upstream source-to-deployment-language transform. The consumer compiles the code in the deployment language, or interprets it. This issue is in support of people hand writing code in the deployment language, or wanting to include annotations in a translated source, etc.

The single minded view that the only relevant use case is an upstream 'compilation' to a deployed binary is part of the problem. This can still be well supported, but this issue is about supporting another use case in which the deployed code is readable and writeable source code.

There is still the matter of what principle holds the AST in place,
particularly if it were found to be non-optimal for some technical
challenge.

If a technical challenge arises that can't be well expressed as an AST,
what would its source look like (since it's also an AST) ?

Sure, for example an interpreter might be better off with a linear byte code, and would this justify flattening the AST?

Just stating that there is an AST is very vague and in the limit
this could mean a linear machine code.

It could, but so can textual source code represent an unstructured
series of instructions. There are degenerate cases in all tree-like
representations, that doesn't mean they are vague.

Yes, and I don't want the text source code to be a linear Assembly code either.

The principle I propose is that some consideration be given to it
being readable and writeable? Do you have some suggestions?

I do. Making a binary format readable is a simple matter of finding the
appropriate visual representation. Making it writable requires writing a
"syntax editor" to be used instead of text editors (something like
Epsilon http://marc.coiffier.net/projects/epsilon.html, an old
proof-of-concept of mine).

Text editors are easy and portable. Visual programming can be interesting too, but do many people edit JS or C++ or any Assembly code in a visual structure editor?

I see an opportunity to support a range of editors with annotations. There could be function header comments, and statement level comments, and line end comments, etc that could all map to a visual editor too. It might just be white space differences that would not map between a visual editor and a text editor and this could probably be well ignored.

No, I suggest a text source code format and an encoded source code
format that can be translated without loss. The encoded source code
would be deployed. I expect machine producers will encode directly
to the encoded format when efficiency is important. I expect
developers to use a text editor to read and write the source code
which would be translated to and from the encoded source code blob
without loss.

Why do you expect developpers to translate /from/ the encoded source
blob when they already have the source ? It seems much more likely that
they will indeed read and write the source, but only write to the
encoded blob.

Because developer A will deploy code to the web in encoded format and developer B might want to study it without wanting to separately obtain the text source code.

There is no need for the deployed blob to be a lossy stripped
version of the text source format. Do you have a reasonable reason
why this must be the case?
...
This casts the deployed blob as not being a first class source code.
It fails to degrade gracefully in size with the amount of
annotation. I don't believe this is technically necessary. What
reason is there to enforce this model on the web?

Simplicity and efficiency, if nothing else. Since the vast majority of
clients are not going to read the source, shipping it all the time would
be a huge waste of space. Additionally, we can design a binary format
that is easy to parse in order to reduce the clients' workload when they
need to run it (offloading some of the strain on the server instead).
Baking in a link to the source (as I suggested) in every binary wouldn't
waste so much space and would enable every good thing about the source
to be at hand when it is needed.

While this is true, it is only true when taking an extreme position on 'simplicity and efficiency'. Adding a few opcodes is not going to move-the-dial on simplicity or efficiency.

The majority of clients will be web browsers that I do hope will be capable of 'read the source' as in having a view-source option.

If a developer wants to include annotations that increase the file size then that should be a matter for the developer. They will know best the target audience.

The proposal does not compromise the 'easy to parse in order to reduce the clients workload'.

You are welcome to propose a separate link to the text source code in another issue - it does not address this issue. You can have your use case, let users have a first class source code too.

I guess what I'm saying is : if you send something to a client, don't
presume that they need everything you can offer. Most of the time, they
just need to run the app, so just send the the app.

Making the deployed blob a lossless source encoding does not not force anyone to deploy non-stripped code - it can be stripped before encoding and the deployed blob is still a lossless encoding.

In contrast, the current design would force everyone on the web to strip their deployment code - it presumes that no one will want to want to deploy annotated source code.

This is not the model proposed. It is true that a goal of wasm is to
offload some of the optimization to the producer but this is
reflected in the primitive operators and computation model. I don't
want to see a deployment text source code that is 'server-side
optimized' into the deployed blob - any of these optimizations can
be upstream operations that can target a deployed language.
...
No, this issue is only about the deployment language. It has nothing
to say about macro expansion between some upstream language and the
deployment language. I worry this point has not been received.

What do you mean ? If your optimizations target a deployed language, in
what sense are they not "server-side" ?

There should be no 'server side' optimizations between the deployment text format and the deployment binary, apart from encoding matters. If there were some then they could be moved upstream of the text source code.

And if there is macro expansion (and/or optimization) between the
upstream language and the deployment language, then how can you rebuild
the original (pre-macro-expansion) upstream source from the deployed blob ?

This issue has nothing to say about upstream production. These are all upstream matters. There is talk of source maps to address these issues.

Wouldn't it be nice to be able to deploy the source code, with
annotations if you chose to, rather than only being able to deploy a
stripped binary?

Who said anything about /only/ allowing stripped binary deployments ?
Quite the contrary, we seem to be in favor of choice : binary for the
most frequent cases, and source deployments for the few curious souls.
Source-only deployment is what we already have with JavaScript, why make
the same mistake ?

The current design is one of 'stripped binary deployment'!

There is no talk of text source deployments. This issue is not even about support for that, it is about making the deployed blob a lossless encoding of the text source code.

The proposal is not analogous to text source JS deployment.

Wouldn't it be nice if as a distributor of software that can view
the source in text format that you had a little less to worry about
on the legal side because the encoded source was a craft-less
lossless encoding of the text source code without a parallel to
'disassembly' or 'reverse engineering'?

Disassembly and reverse engineering are not a problem, nor should they
be thought of as a crime. They are merely ways to understand how a
program works. What people do with that information, that is where the
crimes are, and giving away the source won't stop them from misusing
that information if they truly want to.

Check a commercial license. There are typically restrictions of disassembly and reverse engineering.

For education purposes, or productivity, wouldn't it be nice to be
able to distribute a single runnable encode source file and to allow
students to view the text source code with explanatory annotations
without separately distributing a text source file and explaining
the work flow needed to generate runnable code?

Seems more like a matter of tooling than standards. Give students the
source (with explanatory annotations), give them a tool that compiles
/and/ runs it, and they won't need a workflow at first. When they are
more experienced and start to write multiple modules, teach them how
separate compilation works and teach them the full workflow. By that
time, they won't mind the small extra complexity.

This does not justify why everyone should be forced to use this model, why everyone should be forced to deploy stripped binaries?

You have made some claims about 'simplicity and efficiency' and I dispute them unless an extreme position is taken on these. If I can demonstrate that the file size overhead is less that 1% then would you change your view?

@jbondc
Copy link
Contributor

jbondc commented Dec 4, 2015

@lih Side-note, interesting about Epsilon:

What a third dimension will bring in terms of expressiveness I do not know, but it has the potential to be ground-breaking.

👍 Been thinking about this with languages that can express concurrency in 3D.

@lih
Copy link

lih commented Dec 4, 2015

@jbondc
You see, I didn't think about that, but it's ground breaking ;-) Graphs in 3D don't need to be planar, so dataflow representations can really shine there. What sort of language did you have in mind ?

@JSStats
From your response, I don't think I quite understand what your proposal entails yet. I think I might get it with an example. If I write in Haskell (as I often do) and produce a blob from my program, are you proposing that I be able to retrieve only the Haskell source losslessly from the blob, or also that I be able to generate, say, a Java code equivalent, with comments and annotations preserved from the Haskell source ? Or that I be able to produce another canonical representation of my code, regardless of the source language ?

If I should be able to retrieve only the original source, how does a source map not solve this issue ?

If you want to design a representation that can losslessly translate between any two languages, I'm afraid that can't be done.

If you want a common canonical representation, the Lisp-like representation given by jbondc fits that bill just fine, with a few extensions allowing for comments to be included.

@ghost
Copy link
Author

ghost commented Dec 4, 2015

@lih I do worry that many people do not understand the issue. Lets say you write in Haskell and are translating to native wasm deployment code (not interpreting Haskell), then you end up with a deployment blob. This deployment blob has a text format representation too that is already defined to be isomorphic to the binary blob and is not the upstream Haskell source. This issue is only about the deployment blob being a lossless encoding of its text formant, and has nothing to say about the upstream translation from Haskell to the deployment language. The proposal would give you the option of having the upstream translation include annotations in the deployment blob rather than them always being stripped, and the proposal would support a graceful increase in the file size with the amount of annotation. If you choose to strip the annotations then it would be equivalent to the existing design. The proposal would allow the deployment source code to be viewed with the optional annotations - it has nothing to say about viewing the upstream Haskell source, but source maps might address this and be as your describe an optional link to the upstream source.

A source map between the deployment blob and the deployment text source code casts the deployment blob as not being a first class source code, parallels disassembly and reverse engineering, does not degrade gracefully as the amount of annotation increases, and has it's own usability and complexity downsides.

I don't think a lisp text format would be popular, even though it would be familiar to me, and it is not necessary for the proposal here and thus out of scope for this issue. As an aside, encoding the AST in wasm does not seem to me to change the landscape wrt visual programming - it is already possible to parse text source code to create an AST for this purpose.

It is relatively trivial to show that we can losslessly translate between the encoded binary and its isomorphic text format when it is stripped of all annotations.

It's very hard communicating the proposed solution, so I propose revisiting this issue when I can demonstrate it, which should not be too long after we have a binary and text formant to play with or it might be possible to demonstrate it with the current polyfill-prototype-1.

@kg
Copy link
Contributor

kg commented Dec 4, 2015

@JSStats

This issue is only about the deployment blob being a lossless encoding of its text formant

The text and binary formats are both lossless encodings of the AST, so I don't see the problem here.

@ghost
Copy link
Author

ghost commented Dec 4, 2015

On 12/05/2015 08:56 AM, Katelyn Gadd wrote:

@JSStats https://github.com/JSStats

This issue is only about the deployment blob being a lossless
encoding of its text formant

The text and binary formats are both lossless encodings of the AST, so I
don't see the problem here.

Given the confusion I detect in responses would you and the chairs be prepared to state clearly to the group that they could have an encoded source code that this a lossless one-to-one encoding of the text source code without compromising the encoded size and parsing efficiency?

The AST is not currently a lossless encoding of the text source code, and no consideration has been given to supporting annotations. The current design enforces the stripping of the annotations. This may well suit your uses and agenda and this will still be supported.

Please articulate clearly why we should all be forced to endure stripped deployment blob just because you 'don't see the problem here'?

@kripken
Copy link
Member

kripken commented Dec 5, 2015

@kg I think we've said we are not aiming to guarantee that. Textformat.md says

The text format isn't uniquely representable. Multiple textual files can assemble to the same binary file, for example whitespace isn't relevant and memory initialization can be broken out into smaller pieces in the text format.

Other examples might be that someone might write code in the text format that has meaningful names for locals, labels, etc., while the binary format would have just indices for those things.

@kripken
Copy link
Member

kripken commented Dec 5, 2015

@kg oops, I might have misread your statement as talking about the other direction of conversion.

@ghost
Copy link
Author

ghost commented Jan 6, 2016

Shall explore augmenting the AST to make it a little more readably and writeable and as good a source code as seems practical using an option source code meta data section, and shall do so elsewhere to avoid conflict. This group can ponder the merits later if it wishes when re-visiting the formats.

@ghost ghost closed this as completed Jan 6, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests