-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing the name and focus of this group from WebAssembly and binary code to WebCore etc and source code. #483
Comments
Based on established precedent for this group, one 'lgtm' is all it takes to save the web from a binary machine code - all objections are ignored based on precedent. Please someone on the Community Group support this issue, save the web. If you are not already a member then you can join at https://www.w3.org/community/webassembly/ |
The name WebCore seems well taken. Another suggestion: WebBitScript. A quick US trademark search found nothing for bit-script, nor web-bit-script, but web-script was popular. Expanding on the source compression efficiency, I believe it could be competitive with whatever wasm could achieve. The key would be to have a canonical text source style that compresses most efficiently, and any deviation might increase compressed size to remain lossless. If the producer wanted maximum compression they would firstly canonicalize the style of their source text, and this could just be a compression option to ignore non-canonical styling and text. There could be a canonical white space source convention. The canonical text style need not be stripped of white space, rather it could use standard indentation etc. The same principle might be used for some other source elements such as labels that might have a canonical minified format while the compression could still degrading gracefully if some labels were not in canonical form. For example the producer might choose to not canonicalize function names but to canonicalize block labels and local variables. For example the producer might choose to keep some common local variable names, such as a stack pointer etc, which might compress very well given their repetition. The canonical form might include specific support for comments, so that the consumer knowns the difference between comments and non-canonical styling. Multiple text source formats (JS-like, minified JS-like, s-exp, etc) might be supported, and if in canonical style then they could be converted to another canonical format without loss. Source with non-canonical text would obviously not convert without loss but could still convert to a canonical style. Anyway, it seems quite possible to focus on a primitive 'source' code without compromising the goals of being a 'portable, size- and load-time-efficient format suitable for compilation to the web'. The difference seems to just be what intent people want:
|
Overall I am sympathetic to your general position, being more on the "high-level" side myself. But I think we are close to what you propose already. Specifically, when you say
I think we are already doing something equivalent to that, unless I misunderstand you: we are designing an AST-based language/semantics. It will have both a binary and a text format. And the text format is described in TextFormat.md as
So we are not making just a binary code, rather we are designing semantics that can be efficiently encoded in binary, but also have a direct parallel in a text format; and the semantics are described in an AST, which will naturally be reflected in the text format (which is a benefit of being AST-based). When you suggest a "source code with a binary encoding", do you mean something more than what those two quotes imply? Apologies if I missed your point.
A fair concern, but I believe that is why the TextFormat.md document says, as quoted earlier,
In other words, no sophisticated process or algorithm will be needed to convert back and forth between the text and binary formats, they are just different representations of each other (again, this is a benefit of our being AST-based). Also, the binary is explicitly intended to be decoded to the text format (and vice versa), and browsers will do it for view-source (as also mentioned in TextFormat.md), so such decoding is a natural and expected process, unlike reverse engineering or disassembling.
I think that the name "WebAssembly" has been accepted by the web community, I don't recall much concern on the name itself. So I wouldn't support changing it, given that it already has significant familiarity at this point. However, I think I agree with your underlying concern, in that "assembly" describes something very low-level, which might worry some people. But I think those people saw the documentation stating that we are AST-based, and also saw the clearly AST-based current s-expression format, and therefore they saw that there isn't cause for worry. This of course once more stresses the importance of our being AST-based. Overall I think the name "WebAssembly" is good in that it suggests both "Web" which is high-level, view-source friendly, open, and so forth, and "Assembly" which is low-level, performance friendly and compiler friendly. We do want all of those things, both "Web" and "Assembly". In summary, as I said I am sympathetic to your concerns, but think that we are close to what you want. With one caveat, that we do not have a final text format yet, and clearly this high-level/low-level issue will play out there. I think efforts to ensure that WebAssembly ends up a good fit for the web are best focused on that. We've worked so far mostly on the core semantics, instead of either the binary format or the text format, but I do think most of us have been thinking more about the binary format while doing so. That trend worries me, and perhaps it is what concerns you as well? But, I think if we focus on making sure we get a good text format, everything will be fine. In concrete terms, I think it would be helpful to do more prototyping and discussing of text format options (I intend to do so in binaryen myself, when I can). |
+1 to what Alon said. I tend to agree that "Assembly" in the name is a bit misleading technically speaking. But it is established by now. And it is a cute play on slogans like "JavaScript is the assembly language of the web". |
Thank you for following up. My interpretation of the tone of discussions and decisions being made does not communicate a strong commitment by a core of the group to the AST. It was recently described as an 'implementation detail', and I got no support from this core of the group for clarifying the focus. Using the term 'AST' might well be confusing to someone just starting on WASM, particularly if they know some Assembly code, but it could be worded as 'the development of a structured source code with expressions' which would have communicated this, yet there was no support for this. I just don't think there is enough respect for this being a source code, and that this should be clearly articulated as a technical constraint and a goal. On 11/27/2015 02:58 PM, Alon Zakai wrote:
Yes, the technical difference is not too great (at present). Why not go all-in on this being a 'source code' rather than leaving any doubt.
It reads as an after-thought. It sounds like the 'binary format' is the driver, and could be a linear machine code, and the 'text format' would just follow. 'Isomorphic' is rather vague. Why not just go all-in and state that the binary format is a one-to-one encoding of the text source (comments and white-space and labels and all), and make this a top level constraint. Make it a constraint that a valid source text file can be encoded without loss. I think this would clear up a lot of trouble, and would add only minor complexity for the binary encoding, and I would much rather be addressing some technical nits than messing with a high level constraint or goal.
As noted above, I don't see this as being recognized as a hard constraint by a core of the group, and I suggest going further to make it a high level constraint.
This just does not appear to be articulated as a core constraint, and as noted above 'isomorphic' is vague and it's a lossy encoding unless the source is in a rather restricted canonical form. A lossless encoding just makes things clear: clear for the developers now, clear for newcomers, and perhaps clear for the courts if it could be seen a equivalent to compressed source code (although it's been parsed too).
Yes, the AST was the saving-grace when I saw what was being done, yet it would seem to be wishful thinking on my part, or at least warrant articulating as a high level constraint, and I think this can be best done by making it clear that a source code is being developed with expressions. I think a part of this should be a recognition of the source as not just a compilation target, and some conveniences for writing and reading the code should be accommodated too. Some of the arguments against my suggestion of adding block local variables verged on the bizarre such as suggestions that it would make interpreters less efficient (not able to allocate all locals on entry) yet the AST will require an interpreter to push and pop expression intermediate values anyway (unless there is an agenda to flatten the AST). The use case of the language being a compilation target is given too much prominence. If this is the sole use case then there may be technical merit in stripping the language bare of any and all unnecessary support - the AST will go and it will clearly be a virtual machine code.
Assemblers produce machine code. Recovering the Assembler code is disassembly. Converting it to structured code is (perhaps) reverse engineering. It's the wrong parallel. It didn't really matter for asm.js as there was always the constraint of having the JS source, and the context helps too.
Sounds good, thank you. |
It's true that whitespace and labels and so forth are not 100% preserved. However, I think that is a benefit for the text format and not a downside: it gives us more freedom to represent things. The binary format will be optimized for small download and fast parsing, and might use various encoding techniques which are not necessarily good for readability of a text format. The text format will still be translatable to and from it, but it might do some simple expansions/mergings to make it more readable. At least we have the option to do this, given the current text, while if it said "1-to-1 with every detail in the binary format" then we would not. If you feel that the benefits of being AST-based would make sense to be mentioned more prominently, then I would support that. But the question is where. In that other issue, we were discussing the title page, and I suggested we just copy the original WebAssembly overview from the W3C community page - because it's been there since the beginning, no one had issue with it, and why not be consistent with that. Perhaps there is another location, and I would be open to hear where. But I do feel that it is already mentioned prominently - it's in the very name of AstSemantics.md, and it's stated in TextFormat.md and in FAQ.md. In other words, I think the AST aspect is well-supported in the text. But I agree that source/text format concerns have not always been prominent so far. I think the way to fix that is to focus on them now with prototyping and discussion. |
On 11/28/2015 06:52 AM, Alon Zakai wrote:
I don't think the claimed 'benefit' is real, and I see 'downside's. Can we explore this further? Could you give an example of a substantive 'benefit' to consider? Here's the downside: I believe their would be a huge benefit for the web in the encoded source being capable of encoding the text source one-to-one, including comments and any style convention variations and named labels etc. This would mean that web users could view-source the encoded source and see the annotated code. Not supporting this seems a very significant 'downside'. The other 'downside' to leaving this vague is that it leaves open an ongoing number of decisions that border on a high level and visible area of conflict. Making a decision that the source is encoded one-to-one addresses all these now and we can move on - the remaining challenges are technical matters of how this agreed constraint is addressed and these can be assessed on technical merit and people are probably not so fussed about small differences in the technical solutions. Obviously it still leaves open what is valid source code, which leads into the matter of having expressions (an AST), supporting comments, and labels etc. |
Sure, a concrete example is that the binary format might have only a And on the other hand, the text format might have a The point is that the binary and text formats have somewhat different goals:
And those two goals, of "small and fast" and "easy to read", won't always agree. Hence I think it is useful for us to have some amount of freedom in the binary <-> text relationship. Not too much, obviously, but hopefully enough to let each format be more optimal for its particular goals. Or to put it another way, enough leeway so that details in one do not overly constrain the other - that benefits both of the formats. |
On 11/28/2015 09:10 AM, Alon Zakai wrote:
This use case can still be addressed when the encoded source is a one-to-one lossless encoding of the text source. When the text source also uses only a branch and a numeric index then it will compress just as well. Obvious there will be some overhead in the encoding to handle falling back gracefully, but I believe it will be very small, and in the limit could be just one bit. For example: The One-bit-overhead solution: when the text source is in the most-compressible-canonical-style the C bit is set and the file is encoded as it would be under a wasm binary encoding; when the text source is not in the most-compressible-canonical-style the C bit is clear and the encoded source include two blobs, the first is the wasm binary encoding, and the second is the compresses source text. The One-bit-overhead-per-function solution: as above, but per function or section. I am confident that in practice we could do much better and trade off a small overhead to support a graceful degrading on the encoded source file size when the text source code is not in the most-compressible canonical style. For example adding support for named labels and variables etc, might just add a few extra opcodes or optional sections. So this example is right in the extreme, but given an acceptance of some reasonable overhead to accommodate readable annotated source it is not a valid example. Can you think of another example that does not depend on an extreme optimization for the encoded file size at the expense of all else?
There seems to be a use case or goal to support a compressed encoding and fast parsing but I believe this can still be met while maintaining a one-to-one lossless source encoding. Surely there is also a use case for supporting well annotated source code. I hear people talking about a one way 'assembly' or 'compilation', or including 'debug' info, but this casts the deployed binary as an object file. I have a strong preference for a lossless source encoding rather than using one-way 'assembly' or 'compilation' solutions to address this use case. It makes the intent clear. It also avoids the need for separate text source files to view the annotated source. As another added benefit for the web, if the canonical style is pretty printed then for maximum compression source code would need to be deployed in the canonical pretty printed style and could be viewed in this pretty style. There would be a disincentive to minify or obfuscate code using style, namely larger encoded file sizes. |
How would that be a huge benefit? In the vast majority of cases, there won't be any comments, style conventions or named labels, since the WebAssembly binary code was compiled directly from another language. And to view the code in that another language (including comments etc.), you can use source maps. |
@JSStats:
Sorry if I wasn't clear, I was trying to make the opposite point. Yes, we could make the text format just as compressible, as you state. But it would lose clarity by doing so, since I think higher-level control flow constructs would be preferable to the majority of developers on the web. |
On 11/28/2015 10:54 AM, Petr Onderka wrote:
This use case, which sounds important to you, would not be adversely affected. Even for this use case it would help because compilers can generated source code that includes annotations that can be very helpful when profiling or debugging. @kripken I worry that we are talk about different points. There is a difference between the canonical source text being 'just as compressible', and the text source being able to include annotations and labels etc. That is the source text can be support being just as compressible, and support extra annotations etc for extra clarity. Whatever text format you come up with which I presume will compress well could be the canonical format and then you have all the clarity your wanted - I don't see a difference on this point? It could be a choice for the source text producer if they want to canonicalize their source for maximum compressions or retain some degree of annotations, and the encoding might fall back gracefully in relation to compression efficiency as the text source varies from the canonical style. On the matter of the control-flow, I have not seen a big issue here yet, and it seems quite possible to interpret the blocks/loop/br as high level constructs with a little pattern matching and where there are multiple interpretations (that could lead to some information loss) then some extra opcodes might be need to distinguish them. I am not proposing that the canonical source need be blocks/loop/br, or even that it should use numbers for relative labels, and certainly not that the text source language should expose all the encoding binary details. |
It isn't obvious to me how to create a text format that is both maximally compressible but also has the option for sufficient extra annotations and clarity. But perhaps that is just me not seeing a solution to this that you already do. Concrete proposals for a text format that can do both would be great, of course, that's exactly what we need now. |
|
I think 1 and 2 are exactly what we should be prototyping and experimenting with now. I have hopes that we can achieve them well, but also worries. 3 is an interesting proposal. Let me try to restate it, to see if I understand:
Do I understand you correctly? |
@kripken No, source maps cast the deployed file as a binary object file, with a clear separate and optional source file. The view that 'we see the wasm text format as a language that we compile into the wasm binary format' is part of the problem - a very unfortunate outcome. The 'Assembly' naming implies this interpretation: an assembly pass converting the source to a binary machine code. This issue is about moving away from this view, to a clear source code story for the web, and to have the Specification and the definitions within reflect this and to have groups discussions and language reflect this as a matter of professional good faith. Source maps can still be supported for mapping back to other languages that have been source-to-source translated, but are a separate matter. Obviously to have a clear source code story for the web the MVP needs to be source code too, and the polyfill could ignore many of the annotations (such as comments) when the size of the emitted JS is an issue or map then back to JS as an option if people want the code to be as readable as possible - their will be some loss converting between languages but many elements could be translated, even comments. This is not something that can be feature detected in future versions - it needs to be built in from the start. I am making the case for the source code story to be a core goal and a constraint on the technical outcome, not an afterthought. |
What I wrote was "if we see the wasm text format as a language that we compile into the wasm binary format." In other words, I tried to present it that way in hopes of clarifying things.
The MVP has both a binary and a text format. You don't consider that text format sufficient to count as a "source code", and I surmised that it was because it lacks whitespace and labels. I therefore suggested that a concrete way to implement what you suggest could be to add those things to the MVP text format. But it sounds like I have not properly understood you yet - what more would be required, concretely? |
@kripken Simply: the deployed blob should be a source code or lossless source code encoding, and consideration should be given to it being readable and writeable source code. Adding comments and labels to the 'text format' would help address the second part 'consideration should be given to it being readable and writeable source code', but if these are not encoded in the deployed blob then it casts the blob deployed to the web as a binary code and not a first class source code which I believe would be very unfortunate for may reasons articulated above. The MVP does not currently have a deployable text format, rather the model you noted of a text format that is compiled to a binary format with presumed loss of information. The first step is for the group to make a decision that the deployed blob is a lossless source code encoding. Then the language in the Specification and the name of the group can be updated to communicate this clearly and the technical solution updated to meet this constraint. |
I would put it this way: We have a semantics, and we will have a binary format and a text format. Both will represent those semantics, and both can be converted to the other. Both are important. They are currently not intended to be converted to each other without loss. Lossless source encoding is an interesting proposal, which could perhaps be added to the current model. I can see it adding value, but also complexity, and also it has downsides as noted before. I don't yet have an opinion on it myself. I think we could debate it more at length with a concrete pull request with that addition to the design docs, because this issue here - of changing the name and focus - is far more broad and general. |
It seems to me that getting the semantics right is far more critical than the encoding of the AST. There are valid arguments for having a one-way encoding of the intentions of the programmer. A lot of effort is spent currently in obfuscating JavaScript. |
If I get this proposal right, I don't see any real world benefits of having lossless encoding. Developers will want to view-source and debug code in original (highest level) language. Another point obfuscation, as @fmccabe mentioned - I believe it's a good thing to have this as an option. |
@fmccabe There is certainly a lot of work specifying the semantics, and it is very very difficult getting agreement across vendors. Just stating that it will have an AST is very vague, and it could just mean a linear byte code in the extreme, and if optimizing purely for the 'compilation target' use case or 'fast interpreter' use case then dropping the AST may well be a logical and optimal solution. So what goal or principle holds the AST in place? It might be possible to ague some advantages for parsing to SSA, or encoding efficiency. The principle I propose is that some consideration be given to the source code being readable and writeable. @fmccabe @mkazlauskas Good luck trying to sell 'obfuscation' as a well supported use case of this group to the web community. You should articulate this point in your appeal to the group to adopt a 'one-way' encoding - could you please develop this argument? I hope the web community will support me in rejecting this. I would note that if the deployed blob is a lossless source encoding then you can still obfuscate as you wish, but doing so in ways that are outside the canonical (well readable and writeable) styles will not encode as efficiently. Conversely, adding annotations will also increase the encoded blob size. I would like the group and web community to settle this point, and for the result to be articulated clearly. Lets settle it. @mkazlauskas If the language being developed is to be a first class source code then there will be developers writing and reading it and viewing the source code in text format, and if the deployed blob encodes this source code without loss then they will be well supported too. The use case of viewing the source of translated source code will obviously still be well supported too. |
@JSStats About obfuscation. In fact, I have no need to 'sell' obfuscation. Whether this group supports it or not; there are legitimate reasons why publishers want it. Especially in a world where software patents are looking increasingly 'hard' to get. IMO, one of the key motivations for native apps vs web apps for publishers on mobile devices is exactly the ability to deploy an application without having the world's hackers being able to pick it apart. (Of course, you can disassemble a compiled C++ program, but it is expensive.) As far as the AST is concerned, it is a relatively simple (IMO) part of the overall enterprise. Definitely, having a standard text representation is very helpful for developers. Again, a structured AST – as opposed to a SSA format – simplifies some of the analysis in interpretation. However, there is a risk with it: the temptation to lift the language to something that is closer to what regular programmers might program in. The problem with that is that it is impossible to meet the natural requirement of all programming languages. For example, my interest is in languages like Prolog, Haskell and ML (and others). I could personally care less about C++ or Java (except professionally). Someone from the latter community would have a hard time designing a low-level language that can equally well handle C++ and Haskell. However, they can both be compiled very efficiently to bare metal. |
Move from the public mailing list: On 11/27/2015 06:41 PM, Jeff Lewis wrote:
You may be right, and I see this in discussions and decisions, and I believe this was the direction articulated to me over a year ago. But yet there is mention of a source format and an AST. It needs clarification and needs to be communicated clearly to group members. 'more modern and sophisticated development' seems very subjective, and there is nothing modern about a one-way compilation to a machine code. The one-way compilation solutions seem to have all been failures on the web, and perhaps this is the problem and one not to repeat again! Perhaps we are close to something more 'modern' - a pre-parsed and compressed source code.
There have been p-codes etc dating a long way back, at least the early BASIC interpreters, but even these supported comments. My understanding is that .NET does not encode comments or source code style and is a lossy encoding, so the deployed blob is not source code.
This is certainly not the intention, and the pipeline will not require the text source format. A pre-parsed and compressed source code will be deployed, not significantly larger than a byte code blob would be assuming the text source has been stripped. Making it a first class source code is to support writing and reading and viewing the source, which will require decoding, and to avoid parallels with machine code and disassembly and reverse engineering etc. A text source viewer could also be smart enough to decode the source one function at a time, and so decode incrementally.
So the proposed lossless pre-parsed compressed source code could meet your use cases too, if it had similar efficiency? |
On 11/30/2015 12:21 PM, Frank McCabe wrote:
That is an interesting point, can you substantiate it? What are the primitives you need exposed for these languages that are not supported by the AST, and not supported by a (at least isomorphic) text format? What constraints do your use cases place on the design? |
Both Prolog and Haskell pose challenges to the current design. In some ways they also have a common requirement: a non-standard evaluation resulting in a need for more control over the representation of evaluation. In the case of Prolog, it has an evaluation 'stack' (quotes required) that has two features not found in normal languages: a non-local return (when a Prolog program ends, its return is not necessarily near on the stack. However, that stack must still be preserved in order to support backtracking. The second feature is backtracking. What that means is that there are two separate ways in which a program can return: successfully or unsuccessfully. In general, a good Prolog implementation needs a lot more explicit control of its evaluation stack than languages Java/C++ do. Haskell is a different case again. It's implementation has a number of features that are very foreign to conventional languages. In the first case, arguments are not evaluated prior to entry to functions. The effect of this is that all data looks like code. It also means that the normal array mapping of an evaluation stack is not efficient for Haskell. In the second, there can be multiple return points to a function call: one where the result is represented as a single value, and one or more where the return result is 'postponed' with the components of the return value spread across multiple registers. Both Prolog and Haskell require much more control over the representation of the evaluation machinery than is proposed here. This is, of course, glossing over the absolute requirement for full tail recursion optimization (the proposed mechanisms scheduled for after MVP are not actually strong enough). One additional remark: the kind of structures needed to support Haskell and Prolog are also very good for supporting so-called asynchronous programming. So, even JavaScript and C++ could benefit from these techniques. |
I'm still trying to grasp the problem that's trying to be solved here. It sounds like you're proposing that source code obfuscation should be disallowed. Inventing a new human readable/writeable language sounds like unnecessary abstraction, limitation and complication for this project. I'd suggest creating a new project for the proposed language that compiles to WASM and embeds the compressed "source code" into the generated elf. If the web thinks it's important to have non-obfuscated viewable source-code, then they will adopt the project. |
I'll respond to several comments here... I think you’re wrong on several accounts – but I think it’s also because you’re looking at it the wrong way. First, let me clarify something – the main point I was trying to make is that any bytecode based system will be better than just raw interpretation of JavaScript. I suggested .Net solely because it exists – has 15 years of experience and support – and is open source. I also suggested looking at the Java bytecode – or even rolling a new one. The important point was ‘don’t discard the advantages of a dynamic compilation based system that compiles to a bytecode machine’. Second, there isn’t a solution for JavaScript on .NET that can get close to the performance of modern JS engines in the browser space because no one has written one. There’s never been a need for it. The closest thing would be Silverlight except it was targeted as a Flash replacement, not a JavaScript replacement. However, Microsoft has been pitching JS/HTML as an alternative to C#/XAML for application construction for almost four years now (much to the chagrin of C# developers) and they have gotten good performance out of it. A browser would NOT have to have JS and .Net VMs. It would need a .Net VM with the appropriate Roslyn front end to do real time compiling of JS. But more to the point – to me, the idea shouldn’t be simply slapping in a new JS engine to interpret JS in some other way or worse, translate it back to regular JS – what’s the point of that? The point should be to REPLACE JS with a more productive, safer, modern system that allows for a new JS as well as other languages that can be precompiled to allow smaller payloads with more object safety and interoperability while providing backwards compatibility through language support and dynamic compiling in the browser running that code in a sandbox to limit the damage it can do. As for the overhead - .Net and Java bytecode runs on IoT devices – far, far smaller than even the smallest browser. Heck, even Silverlight – which implemented a subset of WPF was just 4MB as a plugin – this would be integrated as a W3C standard component – so it wouldn’t even be a download. You might want to take a look at this https://www.ghielectronics.com/technologies/netmf to get a sense of just how small .Net can be. There are similar implementations for Java bytecode. At the same time, they’re extensible… you can cache binary bytecode libraries as users hit them and they’re typically small as well – so to me, this is a false concern. I'm not sure who asked the question about how .Net handles comments - because the move to this forum 'obfuscated' the original owner's name... .Net doesn't compile in comments directly, but there are a set of special comments that get compiled into the assembly that allow for integrated documentation at the module, class and member levels. It would be possible to reverse compile them out if desired - or strip them if you want to obfusate a compiled module. Now, onto abustin's comment. You've actually nailed a core part of my concern: compressed source code isn't a good way to handle this. It binds the design to a specific language that's meant to preserve readability in a kind of hamfisted way. It also has the effect of constraining any other language you wish to use with the system. That was a key difference between Java and .Net. Java bytecode was tightly coupled to Java. When people wanted to use other languages in a mixed environment, they ended up translating to Java and then compiling that to bytecode - which emits really inefficient code. With .Net, they took the opposite approach and designed a 'best case' virtual CPU that was language agnostic and then wrote the languages to fit in. That resulted in literally over fifty different languages from C# and C++ to Smalltalk (S#), Pascal.Net to APL of all things - all compiling to the same IL and all interoperable. Even better - you could decompile from one to the other - Compile in C# and decompile in Pascal.Net and keep working with it. It seems to be that this is a much more flexible solution to the 'viewable source code' problem than trying to compress or min source code or even semi-tokenizing the source code into a human readable stream. Cheers, |
On 11/30/2015 09:24 PM, JeffLewisWA wrote:
This is not a valid point because the proposal being developed is not 'interpretation of JavaScript' not even interpretation of a primitive source code language. The blob being deployed is pre-parsed and compressed and optimized for file size and loading and compilation.
Not sure I understand this point. Some support for making the language readable and writeable is orthogonal to the other differences to JS. The solution being developed here is not expected to reach native performance in general due to the overhead of the sandbox. For example a native language could verify the values stored into an object slot and then know the value range when read and use this to further optimize - this is not possible using only the linear memory. Also security through reachability is not possible. etc.
This is not the language being developed here. I don't even think being a base for a JS front end is a use case being considered.
Java was initially targeting small embedded devices. It was re-marketed to salvage something. Many people were quite disappointed by it at the time - it's noting special. The language being developed here is also expected support lightweight consumers, and probably much light as there is no object support and no runtime library baggage.
I noted that .Net does not encode comments - interesting to hear it might. I did email you in private when moving the discussion here, and the message did include my name. Sorry if this was not clear.
There is no difference between the proposed lossless pre-parsed compressed source code, and the current lossy pre-parsed binary encoded code, from the point or view of 'binds the design to a specific language' or 'constraining any other language you wish to use with the system'. It just allows the deployed code to be a readable and writeable source code.
.Net (CLI) is a stack based linear virtual machine code. Far from the conveniences of a readable a writeable source code. I don't want web users having to 'decompile' to view the source code.
With .Net a byte code is deployed - it's not a source code. I largely support the model being developed here, the linear memory, the primitive operations, the lack of object support, etc, but I would like the deployed code blob to be a clear source code and I really don't think it compromises performance or deployed code size or substantively change the AST. Regards |
@JSStats I'm not sure how Native Client has failed other than Google maybe not having any new announcements on it anymore. I've seen negative feedback on it. But the sources and nature of that feedback is questionable. If Microsoft has invested in this approach and has vetted it enough to make it the backbone for Windows 10 Universal Apps, it's possible they've understood how to overcome some of the shortcomings people are mentioning about Native Client. But let's think about this seriously. Java and .NET VM/JIT technologies have existed for a really long time. Why are we seeing changes to a more native approach? Is it possible that all of the smart people working on synonymous technologies at Microsoft and Google have made the same revelation? Which is that they will never achieve native performance without compiling to a native output? By not going with such an approach, do we automatically give up the ability to have native performance? Or do we make it extremely difficult for ourselves to achieve such performance? The main thing that stands out to me in the high-level goals is the 1st item in the list mentioning "native speeds". Is this something everyone is serious about? @kripken Can you point me in the direction that mentions these philosophies? I've read a lot of the documentation on here, what seems closest is the high-level goals. While the high-level goals is very helpful, it seems to be more of a feature-level set of business requirements. A list/matrix of the different approaches, as well as their pros and cons would be extremely helpful for everyone to fully understand what's being done and how to collaborate better. Again, if I'm just not seeing this, please point me in that direction. |
Calling NACL a failure is somewhat hyperbolic. It's a proven, secure solution for deploying apps written in C, etc. It's the vendor adoption that's held it back and made it a perceived failure. WASM (IIRC) is the compromise between asm.js and NACL that all vendors are aligned around. I'd personally prefer mass adoption of NACL or something similar. I also find the idea that a binary format will cause the end of the web is also overblown. Why must a text format prevail in order for the web to keep functioning? W3C literature seems to say use the best encoding for the situation .. http://www.w3.org/People/Bos/DesignGuide/binary-or-text.html. I deal with many low end devices that spend a lot of time in the "parser" phase of plain text code. I look to WASM to remove that unnecessary parser step. |
@JeffScherrer see also the FAQ, which mentions PNaCl and LLVM IR, as alternative approaches that were considered. HighLevelGoals mentions portability as a core concern, which rules out the NaCl approach of building to each architecture (PNaCl also dropped that part of NaCl). There is also Rationale, but it's more specific. I agree it might make sense to write up a comparison against .NET and the JVM, those specifically don't seem to be written up. Briefly,
|
@JeffScherrer @abustin Yes, NACL might have failed in part because it could not get multi-vendor support, sorry. But why the resistance from other vendors? Might it have been that they have different security models and implementations and thus need something a little more abstract? That they want to keep their options open for future low level changes and improvements that would be difficult with NACL? A key advantage I see in the asm.js/wasm approach is that it adds an extra step giving a little more flexibility here. @abustin The proposal is still to deploy a pre-parsed compressed binary blob that is quick to parse and compile, just that it also be a lossless source code encoding. This issue is a little more subtle than text versus binary. I am not claiming a binary deployment of a stripped code 'will cause the end of the web', rather I just don't think it is the best outcome and it seems entirely unnecessary technically. I believe the onus should be on those who insist on stripping the deployment language to justify why this is necessary and to justify this extreme position? |
@JSStats
This is a sensible proposal with obvious advantages, but the implied costs are significant. Extensive Research into compression ratio/decode speed characteristics has occurred already - and this is just the public wasm-focused work. We're leveraging an existing body of research into binary encodings for native instruction sets, trees, and other forms of data. We're also leaning heavily into knowledge about modern stream compressors and knowledge about how web applications are shipped and run on existing devices. Applying all this knowledge to text parsers and textual languages is just hard. People have achieved incredible gains in size & decode speed for existing text languages, but the fact is that it is difficult to make text parsing as fast as a binary decoder. Even a poorly-designed, lazily implemented binary decoder can outperform a clever text decoder in terms of speed and heap usage - both of which are important considerations for the web, especially on resource-constrained platforms like mobile. The existing body of research does not support the idea that it would be trivial to ship an efficient representation for decoding/compilation that also retains all the source-level information such that you can round-trip things like comments & variable names. One could build that format, but you wouldn't deploy it any more than people deploy large blobs of hand-written JS in production. It is certainly the case that some executable formats (i.e. java class files or .net assemblies) retain more information than others, but I'm not aware of any deployed real-world example of your proposal. A file format cannot truly be all things to all people. At present, wasm is focused on solving some specific problems, and adding additional major objectives like round-tripping of arbitrary source risks compromising the most important objectives. |
It might be worth adding Dart VM to this comparison. It's pretty much designed for the web and already has solutions to many problems (e.g. snapshots). |
On 12/01/2015 07:23 PM, Katelyn Gadd wrote:
I really am interested in the work you have done in this area and the binary encoding to be developed, but I seem to be failing to communicate some key point here because I am not proposing to deploy a text source code so the points you make against this are just not relevant.
It seems obvious to me that this could have a low burden. I am sorry if I have been unable to communicate this. Perhaps we can revisit it as the binary format is developed and I can demonstrate how it can support annotations and labels with a small cost to the encoding. Has anyone in the group been able to follow this technical point and thus be able to support it? For example, that adding optional function header comments, and adding opcodes for statement level comments and line end comments, and supporting named labels, would not be a significant burden to the encoding and would not significantly increase the blob size if not used. |
@JSStats I think I may have been confused by the emphasis on textual formats and other details. The specific thing you call out - optional opcodes for metadata like comments - is definitely possible and would fit naturally into the wasm extensibility model. There are some details that would need to be addressed, like how to ensure that the metadata opcodes do not break decoding in implementations that don't understand them. That might be as simple as using the polyfilling mechanisms to define them as a no-op. Encoding metadata as opcodes is something that would work in both textual and binary representations of the format, as it's an AST consideration. The above is still distinct from taking textual sexpr comments and round-tripping them through a binary encoding, however. I was under the impression that you are interested in round-tripping the textual representation with complete fidelity, including whitespace, label/symbol names, and comments. |
It may be sophistry on my part, but a textual format is also always a binary format, so the problem still remains the same : what binary format should we choose ? For that reason, I believe the name WebAssembly to be better suited to describe the true nature of this endeavour than WebScript or WebSource. I like text, but what's the point of having a binary format if the browser still has to parse, optimize and JIT-compile the code ? The binary format should allow the parsing and optimizing phases to be moved out of the client, leaving only the JIT-compiling to be done before running the code (i.e. make the client do as little as possible). Adding optional source annotations could be a plus for people who like to use their browser as a debugger, but let's realize that the infinite majority of clients just won't use that feature (when's the last time your mother clicked "view source" when a problem came up on a site ?). Plus, the obfuscators out there could just obfuscate their code before compiling it anyway, so a "clean source code story" would be dependent on everyone's good will, which is not really an improvement over the current system. If, despite all that, you still really want a canonical source format that can be translated to different viewing conventions, while adapting to every language feature out there, then what you want is a Lisp. It was designed to be as syntax-agnostic as can be, while still being directly mappable to any semantic thanks to its macros. All you need is a list of frontend parsers for all supported languages and you're good to go. Personally, I'd rather have a browsable binary format that's portable and read-efficient, even if it isn't editable like a textual representation. Text is too sensitive to small details like whitespace conventions, syntactic sugar and encoding issues to provide a solid base for a universal standard, IMHO. PS: I am in the process of writing a compiler that uses such a format for its object files, so it is definitely not impossible to design :-D |
@kg Great, perhaps I see a little light in the response. I think it can be extended to give the lossless encoding, but at this point it's too hard making the case so I'll defer this until it can be demonstrated. Bring on the binary encoding, and @kripken's text format. @lih I think the key is a lossless translation. Compilation and assembly are generally lossy. This issue is only addressing the deployed code language - developers are still free to strip their code but I just don't want this to be the only option. |
@JSStats in that case, let's make source annotations optional at compile-time (like I agree that the source should be accessible at all times (free software FTW), but since most clients will never need it, it should be kept separate from the binary for performance. In cases where you need source informations, we could require that the stripped binary offer a link to the annotated version of the same program or a source map, so that the browser could seamlessly switch to "debug mode" if asked. |
@lih This issue only addresses the deployment language, and I don't want to see an 'assembly' or 'compilation' from a deployment text format to the deployment binary blob, and separating it into a binary and source would make the binary a lossy version of the source. Developers might as well just have two deployment blobs, one with annotations and one stripped if they use both to keep things simple. This issue is not about the use case in which the deployment language is a compilation target, and in that case there are planned to be source maps and the source will be separate from the deployment language - but this would be a separate matter for discussion in another issue. |
@JSStats There's bound to be some compilation phase on the client side to obtain a native binary, and I don't really see how compiling text is less objectionable than compiling a pre-parsed binary. Reading back, you seem to equate "binary" with "imperative and unstructured", which could explain our misunderstanding. I'm not suggesting a binary format containing some hard-to-analyze sequence of pseudo-instructions, but rather a structured binary format that describes an annotated optimized AST of the source, along with other metadata. If I understand correctly, you'd like developers to be able to write in the deployment format directly by hand, so it has to be text. I like the immediacy of scripting in the native language of my environment, but there's no reason that the server couldn't recompile the source every time it's modified before sending the resulting blob when clients hit "refresh". That way, the immediacy would be intact, and everyone could use the textual format they prefer. Whatever the format though, if you allow optimizations to be performed on the server-side at compile-time, then even though you could theoretically turn an optimized program back into textual form, it would be a garbled version of the original, and mostly useless to humans. Essentially, bidirectional translation between a blob and a readable source disallows any sort of "macro expansion" between the original source and the deployment format. That's why I suggested that blobs should provide a link to their original source instead. That way, you can optimize the program and still understand it if necesary. PS: you're right, the binary would contain less information than the source, but since it would also contain a link to said source, then no information would be lost, and the compilation process wouldn't be "lossy". |
On 12/02/2015 04:59 AM, Marc Coiffier wrote:
The 'compilation phase' is down stream of the deployment language. This issue is only about the deployment language. We seem to both agree that the deployed code will be 'an annotated optimized AST of the source, along with other metadata.' There is still the matter of what principle holds the AST in place, particularly if it were found to be non-optimal for some technical challenge. Just stating that there is an AST is very vague and in the limit this could mean a linear machine code. The principle I propose is that some consideration be given to it being readable and writeable? Do you have some suggestions?
No, I suggest a text source code format and an encoded source code format that can be translated without loss. The encoded source code would be deployed. I expect machine producers will encode directly to the encoded format when efficiency is important. I expect developers to use a text editor to read and write the source code which would be translated to and from the encoded source code blob without loss. There is no need for the deployed blob to be a lossy stripped version of the text source format. Do you have a reasonable reason why this must be the case?
This is not the model proposed. It is true that a goal of wasm is to offload some of the optimization to the producer but this is reflected in the primitive operators and computation model. I don't want to see a deployment text source code that is 'server-side optimized' into the deployed blob - any of these optimizations can be upstream operations that can target a deployed language.
No, this issue is only about the deployment language. It has nothing to say about macro expansion between some upstream language and the deployment language. I worry this point has not been received.
This casts the deployed blob as not being a first class source code. It fails to degrade gracefully in size with the amount of annotation. I don't believe this is technically necessary. What reason is there to enforce this model on the web? |
Some thoughts:
|
On 12/04/2015 06:27 AM, Jon wrote:
While this text source code is intended to be 'isomorphic' to the deployed encoded code it does not preserve comments or named labels or white space. It is a work flow model in which a text Assembly source is assembled to a binary blob for deployment. Wouldn't it be nice to be able to deploy the source code, with annotations if you chose to, rather than only being able to deploy a stripped binary? Wouldn't it be nice to be able to view the source code as deployed without firstly clarifying if 'disassembly' or 'reverse engineering' to the deployed text source code were legally allowed? There are no flags on the binary to indicate permitted uses, and no support for comments to include a license - you would have to contact the author via a separate channel. Wouldn't it be nice if as a distributor of software that can view the source in text format that you had a little less to worry about on the legal side because the encoded source was a craft-less lossless encoding of the text source code without a parallel to 'disassembly' or 'reverse engineering'? For education purposes, or productivity, wouldn't it be nice to be able to distribute a single runnable encode source file and to allow students to view the text source code with explanatory annotations without separately distributing a text source file and explaining the work flow needed to generate runnable code?
This issue is not about stopping developers stripping their code before deployment, it has nothing to saw about upstream text source code generation, rather it is about not making this the only option. I see a small disincentive to using minification to obfuscate, because the AST does not really care about white space - that's about it. |
@JSStats
Technically, there are two phases to compilation : building and linking. Seeing as the deployment language is what allows the both to be separate, I'd say compilation is part of the problem.
If a technical challenge arises that can't be well expressed as an AST, what would its source look like (since it's also an AST) ?
It could, but so can textual source code represent an unstructured series of instructions. There are degenerate cases in all tree-like representations, that doesn't mean they are vague.
I do. Making a binary format readable is a simple matter of finding the appropriate visual representation. Making it writable requires writing a "syntax editor" to be used instead of text editors (something like Epsilon, an old proof-of-concept of mine).
Why do you expect developpers to translate from the encoded source blob when they already have the source ? It seems much more likely that they will indeed read and write the source, but only write to the encoded blob.
Simplicity and efficiency, if nothing else. Since the vast majority of clients are not going to read the source, shipping it all the time would be a huge waste of space. Additionally, we can design a binary format that is easy to parse in order to reduce the clients' workload when they need to run it (offloading some of the strain on the server instead). Baking in a link to the source (as I suggested) in every binary wouldn't waste so much space and would enable every good thing about the source to be at hand when it is needed. I guess what I'm saying is : if you send something to a client, don't presume that they need everything you can offer. Most of the time, they just need to run the app, so just send the the app.
What do you mean ? If your optimizations target a deployed language, in what sense are they not "server-side" ? And if there is macro expansion (and/or optimization) between the upstream language and the deployment language, then how can you rebuild the original (pre-macro-expansion) upstream source from the deployed blob ?
Who said anything about only allowing stripped binary deployments ? Quite the contrary, we seem to be in favor of choice : binary for the most frequent cases, and source deployments for the few curious souls. Source-only deployment is what we already have with JavaScript, why make the same mistake ?
Disassembly and reverse engineering are not a problem, nor should they be thought of as a crime. They are merely ways to understand how a program works. What people do with that information, that is where the crimes are, and giving away the source won't stop them from misusing that information if they truly want to.
Seems more like a matter of tooling than standards. Give students the source (with explanatory annotations), give them a tool that compiles and runs it, and they won't need a workflow at first. When they are more experienced and start to write multiple modules, teach them how separate compilation works and teach them the full workflow. By that time, they won't mind the small extra complexity.
^^ indeed, that is a great syntax. I wonder where it comes from...
Good question. If the binary format is equivalent to the source, then such a machine would make the source inefficient as well, which would prompt some rewrites, I think.
It doesn't, because nothing can. If I want to garble my source before sending it to the compiler, or the browser or whatever, then pretty much nobody can stop me (including file formats). |
On 12/04/2015 10:39 PM, Marc Coiffier wrote:
Again this issue has nothing to say about upstream processing. The 'compilation' you refer to is an upstream source-to-deployment-language transform. The consumer compiles the code in the deployment language, or interprets it. This issue is in support of people hand writing code in the deployment language, or wanting to include annotations in a translated source, etc. The single minded view that the only relevant use case is an upstream 'compilation' to a deployed binary is part of the problem. This can still be well supported, but this issue is about supporting another use case in which the deployed code is readable and writeable source code.
Sure, for example an interpreter might be better off with a linear byte code, and would this justify flattening the AST?
Yes, and I don't want the text source code to be a linear Assembly code either.
Text editors are easy and portable. Visual programming can be interesting too, but do many people edit JS or C++ or any Assembly code in a visual structure editor? I see an opportunity to support a range of editors with annotations. There could be function header comments, and statement level comments, and line end comments, etc that could all map to a visual editor too. It might just be white space differences that would not map between a visual editor and a text editor and this could probably be well ignored.
Because developer A will deploy code to the web in encoded format and developer B might want to study it without wanting to separately obtain the text source code.
While this is true, it is only true when taking an extreme position on 'simplicity and efficiency'. Adding a few opcodes is not going to move-the-dial on simplicity or efficiency. The majority of clients will be web browsers that I do hope will be capable of 'read the source' as in having a view-source option. If a developer wants to include annotations that increase the file size then that should be a matter for the developer. They will know best the target audience. The proposal does not compromise the 'easy to parse in order to reduce the clients workload'. You are welcome to propose a separate link to the text source code in another issue - it does not address this issue. You can have your use case, let users have a first class source code too.
Making the deployed blob a lossless source encoding does not not force anyone to deploy non-stripped code - it can be stripped before encoding and the deployed blob is still a lossless encoding. In contrast, the current design would force everyone on the web to strip their deployment code - it presumes that no one will want to want to deploy annotated source code.
There should be no 'server side' optimizations between the deployment text format and the deployment binary, apart from encoding matters. If there were some then they could be moved upstream of the text source code.
This issue has nothing to say about upstream production. These are all upstream matters. There is talk of source maps to address these issues.
The current design is one of 'stripped binary deployment'! There is no talk of text source deployments. This issue is not even about support for that, it is about making the deployed blob a lossless encoding of the text source code. The proposal is not analogous to text source JS deployment.
Check a commercial license. There are typically restrictions of disassembly and reverse engineering.
This does not justify why everyone should be forced to use this model, why everyone should be forced to deploy stripped binaries? You have made some claims about 'simplicity and efficiency' and I dispute them unless an extreme position is taken on these. If I can demonstrate that the file size overhead is less that 1% then would you change your view? |
@lih Side-note, interesting about Epsilon:
👍 Been thinking about this with languages that can express concurrency in 3D. |
@jbondc @JSStats If I should be able to retrieve only the original source, how does a source map not solve this issue ? If you want to design a representation that can losslessly translate between any two languages, I'm afraid that can't be done. If you want a common canonical representation, the Lisp-like representation given by jbondc fits that bill just fine, with a few extensions allowing for comments to be included. |
@lih I do worry that many people do not understand the issue. Lets say you write in Haskell and are translating to native wasm deployment code (not interpreting Haskell), then you end up with a deployment blob. This deployment blob has a text format representation too that is already defined to be isomorphic to the binary blob and is not the upstream Haskell source. This issue is only about the deployment blob being a lossless encoding of its text formant, and has nothing to say about the upstream translation from Haskell to the deployment language. The proposal would give you the option of having the upstream translation include annotations in the deployment blob rather than them always being stripped, and the proposal would support a graceful increase in the file size with the amount of annotation. If you choose to strip the annotations then it would be equivalent to the existing design. The proposal would allow the deployment source code to be viewed with the optional annotations - it has nothing to say about viewing the upstream Haskell source, but source maps might address this and be as your describe an optional link to the upstream source. A source map between the deployment blob and the deployment text source code casts the deployment blob as not being a first class source code, parallels disassembly and reverse engineering, does not degrade gracefully as the amount of annotation increases, and has it's own usability and complexity downsides. I don't think a lisp text format would be popular, even though it would be familiar to me, and it is not necessary for the proposal here and thus out of scope for this issue. As an aside, encoding the AST in wasm does not seem to me to change the landscape wrt visual programming - it is already possible to parse text source code to create an AST for this purpose. It is relatively trivial to show that we can losslessly translate between the encoded binary and its isomorphic text format when it is stripped of all annotations. It's very hard communicating the proposed solution, so I propose revisiting this issue when I can demonstrate it, which should not be too long after we have a binary and text formant to play with or it might be possible to demonstrate it with the current polyfill-prototype-1. |
@JSStats
The text and binary formats are both lossless encodings of the AST, so I don't see the problem here. |
On 12/05/2015 08:56 AM, Katelyn Gadd wrote:
Given the confusion I detect in responses would you and the chairs be prepared to state clearly to the group that they could have an encoded source code that this a lossless one-to-one encoding of the text source code without compromising the encoded size and parsing efficiency? The AST is not currently a lossless encoding of the text source code, and no consideration has been given to supporting annotations. The current design enforces the stripping of the annotations. This may well suit your uses and agenda and this will still be supported. Please articulate clearly why we should all be forced to endure stripped deployment blob just because you 'don't see the problem here'? |
@kg I think we've said we are not aiming to guarantee that. Textformat.md says
Other examples might be that someone might write code in the text format that has meaningful names for locals, labels, etc., while the binary format would have just indices for those things. |
@kg oops, I might have misread your statement as talking about the other direction of conversion. |
Shall explore augmenting the AST to make it a little more readably and writeable and as good a source code as seems practical using an option source code meta data section, and shall do so elsewhere to avoid conflict. This group can ponder the merits later if it wishes when re-visiting the formats. |
I would like to make the case to the members to consider changing the focus of this group from the development of a binary code to a source code with a binary encoding. The difference might not sound significant at first but it might make a significant difference to the intent of code deployed to the web in binary format.
In the current case source code is 'compiled' or 'assembled' into the binary format and deployed in binary format. With this focus the developers might be tempted to abandon any claim to the binary encoding being related to the source, and for example move to a linear virtual machine code without expressions or structured flow control etc.
While it might be possible to 'view-source' the deployed code it might be consider 'disassembly' or 'reverse engineering' which are very loaded terms for IP.
I believe that although the operators being developed are primitive and close to the hardware, that these can still be used in a structured source code with expressions and local variables etc to make the code more readable and easier to write. A binary encoding would still be developed that would be a one-to-one reversible encoding of the source (basically a lossless compression of the source). I believe this could still be a good target for the use case of a compilation target which seems to be the current focus.
I have been working away at trying to use type derivation to help eliminate bounds checking, and there has been another recent proposal by sunfish to use some loop analysis to help eliminate bounds checks too, and while I don't have anything concrete I suspect this will be much easier to define in structured code. For example, a common case is to define a local constant variable with a typed that can be derived such as masking a value or asserting its bounds.
The new name would remove 'Assembly' and make it clear this is a source code although primitive. For example WebCore if it is not taken. The Specification language would change it's emphasis to being a source code, while still supporting the use case of being a compilation target.
Would there be any support for such a re-focusing of the group, or are the majority of people wanting a web machine code binary format to compile to?
The text was updated successfully, but these errors were encountered: