-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Normative] Capture the HTML entity behaviors #136
Conversation
@RyanCavanaugh I think we talked about Babel's and TypeScript's handling of HTML entities a few months ago, but I cannot find where that discussion happened.
Do we know if there is any tool using HTML5 entities? As far as I know, the main ones use HTML4. It would be good to reduce implementation-definedness rather than to explicitly allowing it, to guarantee better code portability (this is something that ECMAScript itself tries to do as much as possible). |
@nicolo-ribaudo it was over at #132 |
91b2c12
to
b63b820
Compare
Given that one of the main concerns against TemplateLiteral support was the, shall we say, "conservative" worry that that non-breaking change would be tumultuous for this hyper-stable spec, I think it would be a mistake to kick off the documentation process by introducing a breaking change right out of the gate. If we can confirm that every current major existing implementation respects the same subset of HTML entities, as frustrating and inelegant as that might be, I think we should just responsibly document that and avoid the temptation of adding language that could create new future conflicts between implementations. For all its current faults, I think one thing that most users don't have to worry about is major differences in what their code means parser to parser. Now, If on the other hand we find that they already deviate, then it might create a valid excuse for trying to define some potentially "new" or "undefined" behavior since it can be argued that the behavior is already inconsistent. Even then though, I think it would be better to simply document each implementation in the spec as a proper description of the current reality. To some extent I think the right way to approach this process is like having to document "JSX quirks-mode" and less like documenting "JSX 1.0", since we find ourselves in a somewhat similar situation where we have so much existing code that uses the existing implementations. |
@nicolo-ribaudo @tolmasky I definitely share the same concern, but I was intended to open up this to encourage discussion so we can hear each other! I wanna share one (and probably THE only one) motivating factor that I had on expanding named character reference supports from HTML4 to HTML5: Because of how numeric (decimal or hex) character reference are basically just unicode code points, both Babel and TypeScript already support referencing HTML5 entities in its numeric format. Taking In another word, we only need to pay an extremely minor price of breaking super rare code (like having observed that it doesn't work and insist to write HTML5 named character reference in JSX??), to be able to complete the picture of having full support of HTML character reference (or entities), which, by itself, unfortunately, has a much much larger risk to be dropped completely. This sort of "completing HTML route" way of thinking is also how it may differ from the rationale against template literal which is more of "starting the JS route" way of thinking. |
Yes, this is the part that sucks about documenting existing behavior. The reality is we don't really know who is using these sequences. To give a plausible example, someone could have made a page saying "check out the cool new HTML entities in HTML5:" and listed
I also think that specifically mentioning HTML5 entities in the spec, even as "optional", may have the unintended consequence of putting existing implementations in an awkward position where they look like they are "choosing" the clearly worse option of only implementing HTML4 entities, when in reality they may just be stuck with it for legacy reasons. It may thus in effect simply kick the problem over to the implementors who now have to deal with both users who are upset that HTML5 entities don't work, or are upset that such a change was added, while the specification gets to be "correct" either way. |
How the tools deal with unrecognized entities? It seems they will become plain "&entity;" string which means add/remove entities are theoretically breaking change. |
Actually not. The page author needs to upgrade their babel/typescript version and re-compile the JSX to get the breaking behavior. It doesn't like making a breaking change on the Web standard itself. |
Sure, but that's true of all changes in libraries ;) (and I can't help but feel like the bar for "breaking change" was lower in that other thread...) We could remove HTML entities completely, switch angle brackets for square brackets, and require tag names to be capitalized, and it would still be the case that no one would experience any breaking behavior as long as they never upgraded anything. But I think the implication here is that this change could very well be consumed thinking it is a harmless bug-fix (especially since it is being inserted into the existing "version"), and have a change in behavior that would arguably be significant. Again, this is why I think that it would force Babel for example to implement this in the next semver-major version. Even then, it is quite annoying to have to call out this sort of subtle but significant behavior change in upgrade notes. I could definitely see some teams making the decision that this is a |
I looked at how everyone does its thing, with: const x = <y z="alpha & bravo charlie ©cat; delta echo © foxtrot ≠ golf 𝌆 hotel india ⊢ juliett kilo &lima; mike & november">
alpha & bravo
charlie ©cat; delta
echo © foxtrot ≠ golf 𝌆 hotel
india ⊢ juliett
kilo &lima; mike & november
</y>
// Note:
// - the above uses the same value in an attribute and as text, so that the algorithm can be observed in both places.
// - `&` without `;` turns into `&` by the HTML algorithm, not by XML
// - `©` turns into `©` in HTML
// - echo..hotel are all normal reference
// - `⊢` is an “HTML 5” named character reference, it’s not in HTML 4.
// - `&lima;` is unknown
// - `&` is just a & according to HTML
esbuild: $ npx esbuild --loader=jsx <<< 'const x = <y z="alpha & bravo charlie ©cat; delta echo © foxtrot ≠ golf 𝌆 hotel india ⊢ juliett kilo &lima; mike & november">
alpha & bravo
charlie ©cat; delta
echo © foxtrot ≠ golf 𝌆 hotel
india ⊢ juliett
kilo &lima; mike & november
</y>'
To recap, that means that everybody works exactly the same. No “HTML 5” character references. No “HTML” algorithm. No errors. |
HTMLNamedCharacterReference | ||
|
||
HTMLDecimalCharacterReference:: | ||
`&` `#` DecimalDigits [> but only if MV of |DecimalDigits| ≦ 0x10FFFF] `;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 
allowed?
CommonMark limits this at 1-7
DecimalDigits
: https://spec.commonmark.org/0.30/#decimal-numeric-character-references
`&` `#` DecimalDigits [> but only if MV of |DecimalDigits| ≦ 0x10FFFF] `;` | ||
|
||
HTMLHexCharacterReference:: | ||
`&` `#` `x` HexDigits [> but only if MV of |HexDigits| ≦ 0x10FFFF] `;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is
 
allowed?
CommonMark limits this at1-6
HexDigits
: https://spec.commonmark.org/0.30/#hexadecimal-numeric-character-references - Is an uppercase
X
allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good calls
Is   allowed?
It seems like Babel and TS diverged here. Babel capped it to 6 HexDigit
while TS allowed arbitrary HexDigits
. I think this is an artifact of Babel handling it in parser and TS handle it in a post transform. cc @nicolo-ribaudo @RyanCavanaugh for their thoughts on this.
Is an uppercase X allowed?
I only tried TS and Babel and they both don't and therefore it's excluded from here. Let me know if you've seen any other implementation supporting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at our implementation, Babel expects entities to have at most 10 characters after &
. However, I can't find any reason for that limit other than "it was in the initial acorn-jsx
" implementation:
https://github.com/acornjs/acorn-jsx/blob/802c4cd8cb41692582b76ae2dda4890449c6853f/acorn.js#L1485
I would slightly prefer to allow what HTML allows (I don't know if it matches Babel or TS), but I don't have a strong opinion on that.
EDIT: HTML doesn't have any length limit.
|
||
<p>The exact approach to fulfill the extended sematics is <a href="https://tc39.es/ecma262/#implementation-defined">implementation-defined</a>.</p> | ||
<emu-note> | ||
For example, at the time of writing, Babel implemented this in its <a href="https://github.com/babel/babel/blob/main/packages/babel-parser/src/plugins/jsx/index.js">JSX parser</a>, while TypeScript implemented it via a <a href="https://github.com/microsoft/TypeScript/blob/main/src/compiler/transformers/jsx.ts">JSX transformer</a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably include how whitespace is handled (which implementations disagree about)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I intentionally omit describing whitespace in this PR because I'm aware of the divergence (from previous conversations), which means we probably need some alignments which is better suited by another PRs dedicated for that.
Thanks for reminding!
Representing the TypeScript implementation side of this, I'll just echo this from earlier in the thread:
Cross-tool compatibility really should be the highest-order bit here. If your app renders properly in esbuild, you should be able to switch to Babel, or switch from tsc to SWC without worrying that users will start seeing unrendered Even if tools add configuration settings to control which entity set is used, it's not immediately obvious which entities are HTML4, HTML5, or even newer, so you might not realize until you get an embarrassing bug report that you had that configuration wrong in the first place. |
Let's be faithful to the de-facto and capture the HTML semantics to the spec. I haven't seen any practices specifying transipilers via ECMA-262 so this was a bit challenging may seems foreign. I ended up extending `Static Semantics: SV` to make it concise and (hopefully) accurate enough. I think this is a cool hack ;) - Introduced `JSXStringCharacter` and `JSXString` - Fix facebook#133 by using `::` to make problematic grammars lexcial I intentionally making the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set but this is open to discuss.
b63b820
to
88c892c
Compare
88c892c
to
118ed21
Compare
Thanks everyone for making inputs. Since we've reached a clear consensus I've updated this PR to only allow HTML4 named entities. I guess the only remaining question is #136 (comment) (i.e. is numeral entity allowed to be greater than 6 hex digits) |
I'd recommend going for the strictest restriction: at the parser. Say that they can only be 6 or 7 characters. Because they can. Instead of 100. Which they can't. |
JSXTextCharacter JSXText? | ||
|
||
JSXTextCharacter : | ||
SourceCharacter but not one of `{` or `<` or `>` or `}` | ||
JSXTextCharacter :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up PR, it might be good to separate ::
and :
grammars in separate sections, like ecma262 does. A single :
difference is hard to notice, especially when they are all mixed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been thinking of the same. Thanks for calling that out!
`&` `#` DecimalDigits [> but only if MV of |DecimalDigits| ≦ 0x10FFFF] `;` | ||
|
||
HTMLHexCharacterReference:: | ||
`&` `#` `x` HexDigits [> but only if MV of |HexDigits| ≦ 0x10FFFF] `;` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at our implementation, Babel expects entities to have at most 10 characters after &
. However, I can't find any reason for that limit other than "it was in the initial acorn-jsx
" implementation:
https://github.com/acornjs/acorn-jsx/blob/802c4cd8cb41692582b76ae2dda4890449c6853f/acorn.js#L1485
I would slightly prefer to allow what HTML allows (I don't know if it matches Babel or TS), but I don't have a strong opinion on that.
EDIT: HTML doesn't have any length limit.
🤔 I doubt allowing unlimited length of hex digits here will hurt parser performance. This should be as trivial as lexing a numeric literal which doesn't involve any ambiguity and lookahead. Since HTML doesn't limit this and TS seems to already does this, and since @nicolo-ribaudo has expressed willingness to make change from the Babel's side, I would propose that we keep this unlimited. |
Size in browsers: TL;DR; browsers support giant character references: code = '&'.charCodeAt(0).toString(16)
checks = [2**4, 2**8, 2**16, 2**24];
checks.forEach(d => {
document.body.innerHTML = '&#x' + code.padStart(d, '0') + ';';
console.log(document.body.innerHTML)
})
(tested in chrome, firefox, safari) To test long character references in JSX implementations, I used: <>
<div id="&" />
<div id="&" />
<div id="&" />
</>
Babel: 💥 TypeScript: Does support long character references SWC: 💥 esbuild: npx esbuild --loader=jsx <<< '<>
<div id="&" />
<div id="&" />
<div id="&" />
</>'
Yields: /* @__PURE__ */ React.createElement(React.Fragment, null, /* @__PURE__ */ React.createElement("div", {
id: "&"
}), /* @__PURE__ */ React.createElement("div", {
id: "&"
}), /* @__PURE__ */ React.createElement("div", {
id: "&"
})); TL;DR: 50%/50% I also checked lower ( const x = <>
<y id="&" />
<y id="&" />
</> Babel, TypeScript, SWC: No esbuild: npx esbuild --loader=jsx <<< 'const x = <>
<y id="&" />
<y id="&" />
</>' const x = /* @__PURE__ */ React.createElement(React.Fragment, null, /* @__PURE__ */ React.createElement("y", {
id: "&"
}), /* @__PURE__ */ React.createElement("y", {
id: "&"
})); (no TL;DR: Nobody supports document.body.innerHTML = '&'; document.body.innerHTML // "&" My opinion:
|
|
Speaking as an implementor, if the JSX spec didn't specify a limit, but we capped it at something reasonable like 16 for whatever reason, I'd be super OK with that happening. Spec deviations for implementation performance (or even security) reasons are extremely common in practice, e.g. https://stackoverflow.com/questions/4582012/maximum-number-of-parameters-in-function-declaration |
I wanna echo @RyanCavanaugh that specifications tend to be more theoretical and more generalized than implementations. A turning machine has unlimited memory in theory which we all know it's not possible. In this particular case, a implementation may use But I do wanna thank @wooorm for this detailed investigation and call out this potential concern ❤️ In summary, I think we are good to merge this. I'll file an issue to Babel regarding this change. Please feel free to file issue to other implementations like MDX. Thanks for everyone involved in the discussion! |
Summary
Let's be faithful to the de-facto and document the HTML entity behaviors to the spec. Note that this is not about whether we should "drop this semantics or not", but about documenting the current behaviors that everyone has been living with for years.
The Proposed Normative Change
I'm not aware of any practices specifying such transpiler/transform semantics in ECMA-262 so this is a really interesting attempt 🙂 So I ended up extending
Static Semantics: SV
which is the smartest way I can find to hack the semantics into the ECMA-262 spec. I think this should work and should be accurate enough. I'm curious on how implementors think about it though.I also intentionally left the set of supported HTML entities implementation-defined to allow either HTML4 or HTML5 set. This may be seen as a breaking change in some regard and this is open to discuss here.We've reached consensus that only HTML4 entities are allowed.This commit also close #133 by using
::
for characters which are supposed to be lexical grammars.Close #126
Close #4
Test Plan
open
index.html
and proof-read the spec ;)Preview at my fork https://huxpro.github.io/jsx