-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normative: Update Unicode property lists per Unicode v13 #1896
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(altho "has tests" usually means the tests are merged, so we'll wait til they are before merging this)
Actually can this update Annex E as well? |
You're now excused to no longer wait for tests. tc39/test262#2526 |
I defer to the other editors here. I don't feel comfortable with Unicode. |
2725eb3
to
d63241e
Compare
actually before i land this; @mathiasbynens, would you please update Annex E to describe the observable changes in v13? |
d63241e
to
97e2e48
Compare
@ljharb Added a note. PTAL. |
@michaelficarra would you or @mathiasbynens mind adding an item to the agenda to discuss it? |
Added: tc39/agendas@d15454b @mathiasbynens if you'd like to co-present or create supporting materials, feel free to add yourself. |
14d4a76
to
e1c7179
Compare
I'm curious what you're planning to present/propose. AFAICT, we need to refer some list of Unicode properties (and values, and aliases for both) that must be supported in ECMAScript regular expressions, regardless of where that list is maintained. The upstream Unicode Standard does not currently have such a list, and it seems unlikely it ever will given it already publishes property lists that are different from what we explicitly decided to support in ECMAScript. Feel free to ping me on email before the meeting: [email protected] |
Re-reading some of the above comments, I think I now know where the confusion comes from:
We are already there! There's two separate dimensions here:
Point 1 was addressed by #620 after https://github.com/tc39/notes/blob/master/meetings/2016-07/jul-27.md#10ia-require-unicode-900 (and I hope you're not trying to relitigate that). It sends a very important signal to implementers that they should decouple the implementation from the Unicode data they're using, i.e. have a mechanism in place to quickly update their implementation to the latest Unicode data. Point 2 is addressed by these annual PRs that add new properties introduced by new Unicode versions. If, hypothetically, such a PR would be rejected, then all the pre-existing properties in ECMAScript's list would continue to be supported, and they would still start matching new characters if the new Unicode version changes their definition, which is exactly the intention. |
@mathiasbynens You've not helped me understand why these concepts are separate. Are you saying that the set of supported properties in regular expressions is not derived from a Unicode data set directly? Are there any properties that are defined by Unicode that we choose not to pull in to ECMA262 regular expressions? |
Exactly. There is a large list of properties that Unicode defines, including so-called Binary properties, Enumerated properties, Catalog properties, and so on. Some of those categories of properties we cannot yet support in ECMAScript (e.g. https://github.com/tc39/proposal-regexp-unicode-sequence-properties), others were deemed not useful enough to support in ECMAScript (e.g. PRs like this one based on a new Unicode release can cover a few types of changes without going through the proposal process:
There's plenty. Blindly pulling in everything would bloat JS engine binary size for little to no gain. |
Could those bullet points not be explicitly delegated to Unicode, rather than explicitly enumerated? |
I suppose we could make the spec less explicit in this way, but it seems strictly worse than listing the properties/values and their aliases explicitly. Unicode doesn't list them in any one place, and since Unicode docs assume loose matching (which we deliberately decided against for ECMAScript), there's no clear overview of what is a canonical property/value name/alias vs. what isn't. This is a massive interoperability footgun. It's also not what the committee agreed previously. I don't understand the desire to relitigate this. |
cc @littledan |
It wasn’t clear to any of the editors that this is what we were agreeing to; it’s possible others on the committee were unclear as well. We all like the idea, i expect, of implicitly matching latest Unicode at all times - but i suspect many of us would not like the idea of the current half-measure. |
What half-measure? |
@mathiasbynens The time between a Unicode release and us updating these tables, where certain Unicode properties or Script names/aliases are unavailable, inconsistent with the observed properties of code points from other contexts. For example, I can observe the ID_Start property of a code point by using it as an identifier (using |
@michaelficarra This time delta is always going to exist regardless of how quickly we update the spec, since implementing and shipping things takes time. (Also see Can I Unicode.) It boils down to data vs. implementation. In practice, updating the version of the Unicode data an engine is using is a separate task from updating the lists of properties/values and their aliases RegExps should recognize. In V8 for example, the former is done by updating ICU, whereas the latter is done by updating the hardcoded list of properties in the regular expression engine. The current ECMAScript spec provides the complete list of supported properties, values, and aliases. Implementers need these lists. Making this information harder to find is an interoperability footgun. |
@mathiasbynens If we feel it is important for implementers, we can continue to maintain the lists as non-normative. But the normative text should reference the Unicode data sets and describe a way to derive the supported Scripts and other properties. That way we never normatively specify mixed Unicode support. |
It's important that the lists continue to be included in the spec. Precisely describing a way to derive the supported properties seems tricky and error-prone. I wrote a rough summary in an earlier comment, but there are exceptions — see tc39/proposal-regexp-unicode-property-escapes#27 for some binary properties that were explicitly excluded. (There's other things that contribute to the trickiness, e.g. ECMAScript should control what properties ECMAScript supports. Unicode should control the data. |
To be clear, I like the idea of describing the way to derive the supported properties, but given that a) it seems difficult to get 100% right, and b) that we probably want to retain our freedom to exclude new properties when needed, I'd suggest doing it non-normatively. |
Well I think that's not for us to decide on our own, but a matter to be brought to the committee. |
@mathiasbynens Something I failed to bring up during the plenary discussion is that, if we opposed automatically pulling in new property names/aliases, these kinds of PRs are not strictly editorial and must reach consensus. I think we need to back out this and #1939 until we get that consensus. |
@michaelficarra The commit message for this PR is marked "Normative", not "Editorial". Note that these changes are already shipping in V8, so I'd advise against backing this out at this point. I'd be happy to let you discuss these PRs in plenary before landing in the future. |
@mathiasbynens Result of the editor call is that we'd like these kinds of PRs to get committee consensus before merging. Since the Unicode 13 PRs have already landed, we will leave them in and ask for retroactive consensus at the next meeting. |
Closing the loop — the notes for the TC39 discussion mentioned in this thread can be found here: https://github.com/tc39/notes/blob/master/meetings/2020-06/june-2.md#introducing-unicode-support |
https://unicode.org/versions/Unicode13.0.0/
Tests: tc39/test262#2526
Ref. #1897.