Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize that the binary format can be a lossless and one-to-one encoding of the textual format withou compromising the use case of being target for source-to-source compilers. #501

Closed
wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Dec 13, 2015

Recognizing this technical point seems a first step in addressing the use case of supporting the deployment of source code with annotations. If this technical point is invalid then the reasons might make the case for requiring all deployment binaries to be stripped. Whereas if it stands then we can move on to consider the implications. This question appears to be the resource efficient 'fail early' test developed from discussions in #483

@sunfishcode
Copy link
Member

TextFormat.md does describe a debug symbol section. However, there is much more to being fully lossless.

I recognize the technical point that it is technically possible to design a system which achieves a lossless binary encoding of a text format in a relatively efficient manner, but it would take time and effort, and add clutter if not also complexity to the spec and implementations, and it's not clear what it would achieve. The WebAssembly language in its current form is not well-suited for writing by hand in significant quantities, for much bigger reasons than the lossiness of encoding of certain details of the text format. This is because WebAssembly is being designed to be a compilation target.

@ghost
Copy link
Author

ghost commented Dec 13, 2015

@sunfishcode Thank you recognizing the technical point, much appreciated. I would be keen to land a note on this, perhaps along with the reservations somewhere.

I don't think it would burden implementations as it could likely be designed to be sections or opcodes etc that could be ignored for the purpose of the program semantics or even for viewing the source, so could a note to this effect also be recognized?

If the spec clutter and complexity became a real concern then the details could be specified separately and the main spec could just note reserved and ignored sections and opcodes.

It remains to be seen how much it would burden the spec accommodating it. Some gains could be made very easily now, such as opaque text blobs between functions and sections.

'what it would achieve' might be better discussed in other issues, otherwise this would descend into issue 483. Recognising and demonstration the possibility, and giving people something to play with, might help them get their mind around the deployed format also being capable of being annotate wasm source code and recognize some advantages to them.

@sunfishcode
Copy link
Member

README.md is an introductory document. The text proposed to be added to it here doesn't introduce anything discussed anywhere else in the project, so as it stands it seems like it would only increase confusion.

Also, it uses several terms in what appear to be loaded ways. It says "lossless" but then only talks about comments and label names. It says "source-to-source compilers" in what reads to me as a recasting of WebAssembly's high-level goals in a specific way which isn't explained anywhere. I don't know what "parts of the text source that do validate" means.

We do already say that unrecognized sections are ignored, so one path forward here would be to create a new independent project to create a spec for a section containing additional information.

@kripken
Copy link
Member

kripken commented Dec 13, 2015

I am personally opposed to this for the reasons I said in the previous discussion. In my opinion, a 1-to-1 mapping of the binary and text formats is potentially dangerous for the quality of the text format, and the quality of the view-source experience on the web:

  • Locking down a 1-to-1 mapping means that when we make decisions on the binary format for compactness and efficiency, those decisions have direct implications for the text format.
  • What is optimal for the binary format might or might not be optimal for the user experience of view-source on the web.
  • We don't yet know what we want view-source to look like (I think we should focus on this a lot more, as I've been saying for some time). Adding more constraints to the text format at this stage therefore is risky.

@jfbastien
Copy link
Member

I don't understand what the technical advantage to having lossless and 1-to-1 source/binary encoding is. I agree with @sunfishcode and @kripken's points, and see downsides more than upsides.

Two non-technical considerations against having this property:

  • Making the textual representation something that's expected to be edited frequently has negative implications with most licenses including Apache 2.0 (even with a runtime exception). Specifically, it's desirable to avoid the ramifications of this statement:

    "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

  • Our community group's and the Web's working expectation is that view-source is always supported. We want to clarify the action to be conventional and commonly done. We explicitly encourage view-source support since we view text and binary as equivalent. The WebAssembly text and binary formats aren't the original sources, rather they are compilation targets.

I'm in the process of refining our stated goals as well as our licensing to resolve a few issues, and the above two points are part of what I'm trying to address. I'm still talking to lawyers: the words I'm using above have very specific meaning and legal implications which I think you'll will find desirable. I'll provide more details once I have something finalized, but we're unfortunately not able to dive into all of the details (you should consult with your own lawyer for this).

I think the issue @JSStats raises touches part of what needs to be fixed. I'm simply not convinced that his proposed solutions lead to what he wants!

@ghost
Copy link
Author

ghost commented Dec 13, 2015

@kripken I was hoping to break it down and just recognize the technical possibility here. I will take a look at addressing all the feedback above and rework the patch.

I agreed with your concerns, that just accommodating this would add some burden to both the binary and text formats, but that seems to be a matter yet to be determined and I can't think of any matter than can not be easily resolved?

The specification already constraints the binary and text formats to be isomorphic and also that it must be possible to 'pretty-printing' the binary encoded source code into the text formant which to be isomorphic needs to re-encode into semantically identical code. It would also seem fair to assuming that this transform could be done incrementally in one passes, further constraining both anyway. I believe there will be other practical constraints too in support of the pipeline complexity and need to be able to stream the code through a transform pipeline that will also demand the text and binary formats be close. So the residual burden from this proposal might not be too significant.

Perhaps you would like a higher level 'Assembly' code, but this will not be the product of 'view-source' anyway so there seems a good argument that this be moved upstream. Source maps can handle this case.

I would be happy to note this 'potentially dangerous' fear, but it does not seem to be a practical technical matter on which decisions can be made yet. If you have something specific in mind then I would be happy to consider it?

This will all need to be locked down for the MVP, so I hope in a short time frame anyway, and there will be a lot of creative work to do over the next few months sorting out the binary format and the text format, so your concerns on the burden will be exposed into concrete technical matters soon.

@kripken
Copy link
Member

kripken commented Dec 13, 2015

Yes, I can't separate out my concerns into concrete issues, but that's part of the problem - we are not even close to having a clear direction for a text format, and without that, can't evaluate how much risk your proposal would add. Any additional constraint, on top of all the ones we already have, is more risk.

If we want view-source on the web to be a good user experience, it's my strong belief that it can't be something we tack on or cobble together at the end of the wasm design process; it can't just reflect 1:1 whatever we find is optimal for the binary format - the two formats have very different considerations.

@ghost
Copy link
Author

ghost commented Dec 14, 2015

On 12/14/2015 08:09 AM, JF Bastien wrote:

I don't understand what the technical advantage to having lossless and
1-to-1 source/binary encoding is.

The advantages might not be technical. They might be subjective. They might relate to principles of extensibility of web formats, etc. Thus I did hope to focus this issue on only the technical possibility and that it would not technically compromise the 'compilation-target' use case. But I am always happy to explore the topic.

Two non-technical considerations /against/ having this property:

  • Making the textual representation something that's expected to be
    edited frequently has negative implications with most licenses
    including Apache 2.0 (even with a runtime exception). Specifically,
    it's desirable to avoid the ramifications of this statement:

    "Source" form shall mean the preferred form for making
    modifications, including but not limited to software source
    code, documentation source, and configuration files.
    

For the source-to-source compilation use case the deployment source code would not be the 'preferred form for making modifications' for the purpose of this license. At least with this proposal people could include a license as a file header comment.

I do believe that the US courts have already set precedents that even machine code is 'source code' because trained people might read and write and share information in this form. There were analogies to music notation and graphics etc. So even the currently specified binary deployment file could be legally consider source code for some matters, but might not be considered the 'source' for the definitions in some licenses. Please let us know if you receive some other legal insights. Here's one link I found quickly: https://www.eff.org/cases/bernstein-v-us-dept-justice The DeCSS cases were interesting too, but failed for other reasons.

Perhaps you could ask if requiring all comments be stripped (including licenses) could be an even bigger show stopper for a lot of licenses? People will 'view-the-source' without the license, so software licenses that demand the license and disclaimers etc remain could have big problems with stripped source code!

  • Our community group's and the Web's working expectation is that
    view-source is always supported. We want to clarify the action to be
    conventional and commonly done. We explicitly encourage view-source
    support since we view text and binary as equivalent. The WebAssembly
    text and binary formats aren't the original sources, rather they are
    compilation targets.

But the deployment format could be the original source for some users, and it is still a matter for the author to define 'original source' not for us. I worry that what you are effectively doing is restricting the use cases for this new web format, and this seems go against the principles of openness and extensibility for the web. Even if this group did not want to support source code annotations, other groups should be able to extend this web format to meet this use case unless a compelling technical reason can be demonstrated and I don't think there is one.

I'm in the process of refining our stated goals as well as our licensing
to resolve a few issues, and the above two points are part of what I'm
trying to address. I'm still talking to lawyers: the words I'm using
above have very specific meaning and legal implications which I think
you'll will find desirable. I'll provide more details once I have
something finalized, but we're unfortunately not able to dive into all
of the details (you should consult with your own lawyer for this).

The problem with this group accepting any advice you receive is that layers push positions that they are paid to so we can't take your private advice into consideration. No offence intended to lawyers. Courts set public precedents, and if you can mention some for us to consider then this might help. I suggest all we can practically do is take prudent steps, and removing the distinction between the binary encoding and the text encoding seems to remove complexity on some of these topics - but might create a little more technical complexity but we are good with technical complexity and it is much easier to reach consensus on them.

@ghost
Copy link
Author

ghost commented Dec 14, 2015

On 12/14/2015 02:39 AM, Dan Gohman wrote:

README.md is an introductory document. The text proposed to be added to
it here doesn't introduce anything discussed anywhere else in the
project, so as it stands it seems like it would only increase confusion.

Moved to the Rationale.

Also, it uses several terms in what appear to be loaded ways. It says
"lossless" but then only talks about comments and label names.

Good point, there is no need to introduce new rhetoric here so switched to use the existing wording in the spec.: 'functions, locals, globals, etc,'.

The use of 'lossless' is not intended to mean that the text format needs to support a boundless range of extensions, just that what is supported can be encoded without loss. A note on how this could already apply to the pretty-printed source has been added.

It says "source-to-source compilers" in what reads to me as a recasting of
WebAssembly's high-level goals in a specific way which isn't explained
anywhere.

Again to avoid introducing new rhetoric here it now references the high level goal: '... the current high level goal to define a size and load-time-efficient binary format to serve as a compilation target'.

I don't know what "parts of the text source that do validate" mean.

For example, if the text source code is simply not syntactically valid it could still be encoded in an opaque text blob.

With the current plan I presume this would simply fail to 'Assemble' into the binary format.

@ghost
Copy link
Author

ghost commented Dec 14, 2015

@kripken I think we are on the same page, both wanting a good view-source user experience.

The wording has been softened to: 'It is recognized as technically plausible for the purpose of ongoing discussions ...'.

Also the uncertainty you note is mentioned: 'This recognition is qualified with the reservation that the binary format and textual format are yet to be defined so the technical burden is not yet certain.'

If I understand your other point then might the follow addition address it: 'A lossless one-to-one encoding is expected to help make the code readable and writable but is not sufficient to ensure these properties.'

@jfbastien
Copy link
Member

The advantages might not be technical. They might be subjective. They might relate to principles of extensibility of web formats, etc. Thus I did hope to focus this issue on only the technical possibility and that it would not technically compromise the 'compilation-target' use case. But I am always happy to explore the topic.

I still don't understand what the upsides of your proposal are, technical or otherwise. Could you please explain?

For the source-to-source compilation use case the deployment source code would not be the 'preferred form for making modifications' for the purpose of this license. At least with this proposal people could include a license as a file header comment.

I do believe that the US courts have already set precedents that even machine code is 'source code' because trained people might read and write and share information in this form. There were analogies to music notation and graphics etc. So even the currently specified binary deployment file could be legally consider source code for some matters, but might not be considered the 'source' for the definitions in some licenses. Please let us know if you receive some other legal insights. Here's one link I found quickly: https://www.eff.org/cases/bernstein-v-us-dept-justice The DeCSS cases were interesting too, but failed for other reasons.

Perhaps you could ask if requiring all comments be stripped (including licenses) could be an even bigger show stopper for a lot of licenses? People will 'view-the-source' without the license, so software licenses that demand the license and disclaimers etc remain could have big problems with stripped source code!

I'd like all of us to avoid interpreting the law when none of us are lawyers. My experience is that engineers, self included, aren't qualified to do so. Regardless of what we conclude we'll end up being shown how wrong we were when we do consult a lawyer!

But the deployment format could be the original source for some users, and it is still a matter for the author to define 'original source' not for us. I worry that what you are effectively doing is restricting the use cases for this new web format, and this seems go against the principles of openness and extensibility for the web. Even if this group did not want to support source code annotations, other groups should be able to extend this web format to meet this use case unless a compelling technical reason can be demonstrated and I don't think there is one.

The format not being original source is one of the main upsides. I don't see how this goes against openness nor extensibility, the format is still entire documented and rigorously specified. You can extend the format with annotations all you want, the optional sections are entirely ignored for such purpose (same as debugging).

The problem with this group accepting any advice you receive is that layers push positions that they are paid to so we can't take your private advice into consideration. No offence intended to lawyers. Courts set public precedents, and if you can mention some for us to consider then this might help. I suggest all we can practically do is take prudent steps, and removing the distinction between the binary encoding and the text encoding seems to remove complexity on some of these topics - but might create a little more technical complexity but we are good with technical complexity and it is much easier to reach consensus on them.

Again, you're questioning my and my coworker's professionalism, and the honesty of our intents. I take offence to this. If you wish people to continue engaging with you then I highly suggest you stop doing this. Getting paid to work on WebAssembly doesn't taint the intent browser folks have as fosters of the web. In fact professionals such as lawyers have further standards to adhere to in these circumstances.

@ghost
Copy link
Author

ghost commented Dec 14, 2015

How about we adjourn for the holidays. Could some from Google responsible for Mr Bastien please contact me privately.

@lukewagner
Copy link
Member

I definitely agree with the top-level goal of providing a good view-source experience. I don't see the benefits of being able to provide 100% lossless encoding of arbitrarily-formatted text and I do see this as having a non-trivial specification/implementation burden, so I'll agree with @kripken, @sunfishcode and @jfbastien about not wanting to include this in the design docs.

However, I do think we can define a "canonical" text format which can be losslessly encoded (when the optional symbol section is included). In particular, I was realizing over the last week, in the context of discussing wasm with our devtools folks, that our spec-defined binary->text projection should include white-spacing and line breaking so that line/column numbers were portable between different tools/browsers. So basically the canonical format would be defined as text for which text = to_text (to_binary text).

Lastly, in the farther future, Tooling.md includes the goal of providing a source-maps-like experience. This would likely involve embedding copies of the source which, as a corner case, would allow 100% lossless preservation of source wasm text.

@ghost
Copy link
Author

ghost commented Dec 14, 2015

On 12/15/2015 04:41 AM, Luke Wagner wrote:

I definitely agree with the top-level goal of providing a good
view-source experience. I don't see the benefits of being able to
provide 100% lossless encoding of arbitrarily-formatted text

This PR is focused on recognizing a technical matter, and recognizing no substantive technical impacts. Landing this is part of a larger discussion so opinions on the 'benefits' (even technical) are out of scope and given the difficulty in these discussion can I ask that we stay focus here. I amhappy to continue the discussion once this matter is settled.

and I do
see this as having a non-trivial specification/implementation burden.

The 'implementation burden' is small as it need only ignore some sections or opcodes. Would you concede this point?

Further as discussed above the specification burden could be mitigated by separating it. The patch does recognize the unknown burden on the specification in just accommodating this. But as argued above it is a residual burden because a canonical text format is already planned. Would you concede that the patch has qualified this adequately?

However, I do think we can define a "canonical" text format which /can/
be losslessly encoded (when the optional symbol section is included). In
particular, I was realizing over the last week, in the context of
discussing wasm with our devtools folks, that our spec-defined
binary->text projection should include white-spacing and line breaking
so that line/column numbers were portable between different
tools/browsers. So basically the canonical format would be defined as
text for which |text = to_text (to_binary text)|.

Great, this is basically the claim being requested to be recognized here, and I would like to land recognition of this technical point to help make progress with development discussions.

It is an important technical point in a very frustrated discussion that might impact the high level goals and focus of this project. Based on discussions I expect many members to be surprised that this is technically possible. Not all members of the group might appreciate this technical point, yet want to have input into the more subjective matters, so I would like to separate them.

Will you also concede that this canonical text format is already spec'ed to include some annotations via the 'debug info' support already planned? Will you concede that there need be no addition burden for these?

I don't believe it would be a significant burden to extend this canonical text format somewhat to support text comments, probably much simpler than the existing planned 'debug' support or the planned 'source maps'.

Will you concede that supporting text comments between sections and functions is trivial?

Will you concede that if only canonical text were considered as valid then non-canonical text could be easily supported as opaque text blobs?

I would like to push what we can agreed on so that we can take a look at the agreed state which I expected would be a very useful one-to-one lossless text source code format. I would like to see what can be agreed on so that the story options can be put to the group members to help move forward with the larger discussions.

I am happy to accept that some of the points are not yet conceded, and it does not seem unreasonable to expect them to be demonstrated, but it already sounds like we can agree on some matters.

Are you making a strong statement that it is non-plausible and thus should be disregarded in ongoing discussions?

If some of the disagreement comes down to subjective matters then I would like to identify these to remove them from this PR to keep the focus on technical matters.

Lastly, in the farther future, Tooling.md includes the goal of providing
a source-maps-like experience. This would likely involve embedding
copies of the source which, as a corner case, would allow 100% lossless
preservation of source wasm text.

Attaching the entire source code file would be one possible solution. However this solution fails to degrade gracefully as the level of annotations increases, and the compressed size would most likely be very uncompetitive. For example, just to add a comment at the header of the file would demand attaching the entire text source code file. It appears practical to do much better with finer grain support.

@jfbastien
Copy link
Member

This PR is focused on recognizing a technical matter, and recognizing no substantive technical impacts. Landing this is part of a larger discussion so opinions on the 'benefits' (even technical) are out of scope and given the difficulty in these discussion can I ask that we stay focus here.

Are you saying that the merit of this PR is tied to resolve another discussion?

  • If so, we should resolve that other discussion first. Which discussion is this?
  • If not then we need to discuss the technical and non-technical merits of this PR before committing. Could you please clarify the merits you perceive?

@ghost
Copy link
Author

ghost commented Dec 15, 2015

@jfbastien This PR speaks only to a technical matter on an option, making no decision about adopting the option, thus any discussion about adopting the option is out of scope for this PR unless there are concerns that the wording does imply a decision in which case please speak up so it can be reworded. Documenting the technical merits of options may be useful input into other decisions. It's getting close to the holidays, and perhaps we could all take a break from public discussion on this topic, and pick it up next? Take care.

@ghost
Copy link
Author

ghost commented Dec 15, 2015

@lukewagner Another important point I forgot to mention in relation to the solution of attaching a text source file and using source maps: in this case there is no guarantee of consistency between the text source semantics and the parsed encoded semantics. It would be a big burden for the consumer to be expected to parse the attached text source to validate that it was consistent with the parsed encoded semantics. In contrast if we just encode non-semantic differences between non-annotated canonical text source and the annotated text source then we are assured of semantic consistency which is a nice property. Have a good holiday everyone.

…oding of the textual format without compromising the high level goal to define a size and

load-time-efficient binary format to serve as a compilation target.
@jfbastien
Copy link
Member

@jfbastien This PR speaks only to a technical matter on an option, making no decision about adopting the option, thus any discussion about adopting the option is out of scope for this PR unless there are concerns that the wording does imply a decision in which case please speak up so it can be reworded. Documenting the technical merits of options may be useful input into other decisions.

We add things to the design repo based on the merit of the idea. Yes I recognize that we can make the text ↔ binary transition lossless, but that doesn't mean that we should do it. There may be technical and non-technical reasons to do it, and it's based on these that we should edit the design repo. If we don't have good reasons then it's an idea that's up for discussion, and github issues are ideally suited to mark ideas we want to revisit later.

Put another way: PRs aren't the right medium for ideas which seem plausible but aren't sound yet. The design repo is meant to express what we think WebAssembly is today. Issues are used to track what's left to do. Committing an idea which isn't even partly finished to the design repo is misleading to the reader.

It's getting close to the holidays, and perhaps we could all take a break from public discussion on this topic, and pick it up next? Take care.

I'm still at work and likely won't take much of a break: I already took a long vacation a few weeks ago.

@ghost
Copy link
Author

ghost commented Dec 16, 2015

One implementation and specification burden than does not seem to have been articulated yet is that it would be necessary to be able to pretty-print all the content of the binary format otherwise there would be a loss of information. This would conflict with there being sections that could be ignored which has been suggested as a mechanism to allow future extensions to have some backwards compatibility. This would result in such extensions being lost by a decoder without support for these extensions which would be a serious issue for sharing code in this binary format and perhaps a show stopper! Perhaps I was blind to this issue because in lisp the code is represented by data and the pretty-printer operates on the data layer and extensions build on the data layer. I still believe a good discussion of the issues in this area does belong in the rationale.

@ghost ghost closed this Dec 16, 2015
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants