-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018
Conversation
0a7ae74
to
b40f0ee
Compare
b40f0ee
to
469459e
Compare
spec.html
Outdated
1. Let _cu1_ be floor((_cp_ - 0x10000) / 0x400) + 0xD800. | ||
1. Let _cu2_ be ((_cp_ - 0x10000) modulo 0x400) + 0xDC00. | ||
1. Return the code unit sequence consisting of _cu1_ followed by _cu2_. | ||
1. Return the string-concatenation of _cu1_ and _cu2_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cu1 and cu2 are code units, and I'm not sure we should directly "concatenate" them. We should instead say
Return the String value consisting of cu1 followed by cu2.
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of string-concatenation
specifically allows that an argument can be a code unit, so the string-concatenation of two code units should be fine. (In fact, the String consisting of the code unit _first_ followed by the code unit _second_
was one of the specific forms that was replaced with an invocation of string-concatenation when it was introduced.)
What has bugged me about this operation is that _cu1_
and _cu2_
are set to the results of arithmetic formulae, which makes them either Number values or mathematical values (depending on how PR #2007 works out), and usually the spec draws a distinction between such values and code units. (E.g., using phrasing such as "the code unit whose value is [numeric value]" or "the numeric value of [code unit]".) So I'd be inclined to say, e.g.
1. Let _cu1_ be the code unit whose value is floor(...) + 0xD800.
That's actually independent of this PR, but we got close enough to it that I thought it was worth mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
469459e
to
c9b32a5
Compare
force-pushed to resolve merge conflicts. |
Editors talked about it; we're on board with this direction but need to review it. |
spec.html
Outdated
<h1>Static Semantics: CodePointToUTF16CodeUnits ( _cp_ )</h1> | ||
<p>The abstract operation CodePointToUTF16CodeUnits takes argument _cp_ (a numeric code point value). It performs the following steps when called:</p> | ||
<emu-clause id="sec-codepointtostring" aoid="CodePointToString" oldids="sec-utf16encoding,sec-codepointtoutf16codeunits"> | ||
<h1>Static Semantics: CodePointToString ( _cp_ )</h1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually still kinda prefer calling this AO CodePointToUTF16CodeUnits
since it's really clear why it's needed. @tc39/ecma262-editors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that makes it more clear - the direct analog here is String.fromCodePoint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I think about fromCodePoint
, I am thinking about the fact that I put in a Number and get out a String. Here I want readers to think about the fact that they put in a code point and get out a sequence of code units of length potentially greater than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a sequence of code units of length potentially greater than 1" is also called "a string".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CharCodeToString would of course return a string of length 1, because a charcode represents a single code unit. A code point represents 1 or 2 code points, so it would return a string of length 1 or 2. I don't see why it needs emphasizing, or why "CodePointToUTF16CodeUnits" would make it clear that it can return a string of length < 2 (to me, "units" is plural, and thus implies "more than one").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preamble could say that the operation returns a String consisting of 1 or 2 code units. Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmdyck I don't think that really addresses it. I'm thinking about the reader understanding what the function does at each call site. I think using the term "string" within the preamble or the steps is fine since you have the explicit description of its behaviour right there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree that unifying on Strings in favor of sequences of UTF16 code units is a good thing, and we did on the last editor call, then this AO should avoid UTF16CodeUnits
in the name as that would be its own form of confusion.
The discussion seems to be stuck on whether the plural "Code Units" better primes the readers at callsites to think about the length of the result, which can be 1 or 2. I don't find the CodePointToUTF16CodeUnits
name to be particularly clear on conveying that point.
There's also an argument that CodePointToUTF16CodeUnits
is clearer on what the AO does than ToString
. If that's the position, then it feels like you actually don't think unifying on Strings in favor of sequences of UTF16 code units is a good idea to begin with.
Maybe something like UTF16EncodeCodePoint
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like UTF16EncodeCodePoint
as much as CodePointToUTF16CodeUnits
, but I'd be OK with it; it's better than the other suggestions so far.
c9b32a5
to
475cd0b
Compare
(force-pushed to resolve merge conflicts) |
475cd0b
to
4c2d292
Compare
force-pushed to: |
(i.e., cases that are handled by the rule for chain productions)
This simplifies the algorithm for CharacterValue of LegacyOctalEscapeSequence, avoiding the detour through SV.
This simplifies the algorithm for CharacterValue of HexEscapeSequence. (Also, move the definition of MV of Hex4Digits here.)
In general, SV returns a sequence of code units. So, on the face of it, it's odd to require that the result of SV fall within a range of integers. Instead, it's simpler to phrase the requirement in terms of the MV of |Hex4Digits|, avoiding the detour though SV.
4c2d292
to
d5948f7
Compare
Where the SV of a production only depends on the MV of the production's LHS, change it to use the 'parent' production, so that SV doesn't reach deeper than it needs to. (This commit would have been more at home in PR tc39#2018.)
Where the SV of a production only depends on the MV of the production's LHS, change it to use the 'parent' production, so that SV doesn't reach deeper than it needs to. (This commit would have been more at home in PR tc39#2018.)
Where the SV of a production only depends on the MV of the production's LHS, change it to use the 'parent' production, so that SV doesn't reach deeper than it needs to. (This commit would have been more at home in PR tc39#2018.)
(See issue #828.)
The first 4 commits are independent refactorings that would be useful even if the main point of the PR is rejected. They're here because they reduce the number of cases that the final commit has to deal with.