Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018

jmdyck · 2020-05-29T02:13:37Z

(See issue #828.)

The first 4 commits are independent refactorings that would be useful even if the main point of the PR is rejected. They're here because they reduce the number of cases that the final commit has to deal with.

michaelficarra · 2020-06-02T00:06:56Z

spec.html

        1. Let _cu1_ be floor((_cp_ - 0x10000) / 0x400) + 0xD800.
        1. Let _cu2_ be ((_cp_ - 0x10000) modulo 0x400) + 0xDC00.
-        1. Return the code unit sequence consisting of _cu1_ followed by _cu2_.
+        1. Return the string-concatenation of _cu1_ and _cu2_.


cu1 and cu2 are code units, and I'm not sure we should directly "concatenate" them. We should instead say

Return the String value consisting of cu1 followed by cu2.

or something.

The definition of string-concatenation specifically allows that an argument can be a code unit, so the string-concatenation of two code units should be fine. (In fact, the String consisting of the code unit _first_ followed by the code unit _second_ was one of the specific forms that was replaced with an invocation of string-concatenation when it was introduced.)

What has bugged me about this operation is that _cu1_ and _cu2_ are set to the results of arithmetic formulae, which makes them either Number values or mathematical values (depending on how PR #2007 works out), and usually the spec draws a distinction between such values and code units. (E.g., using phrasing such as "the code unit whose value is [numeric value]" or "the numeric value of [code unit]".) So I'd be inclined to say, e.g.

1. Let _cu1_ be the code unit whose value is floor(...) + 0xD800.

That's actually independent of this PR, but we got close enough to it that I thought it was worth mentioning.

michaelficarra

LGTM otherwise.

jmdyck · 2020-09-18T02:48:34Z

force-pushed to resolve merge conflicts.
Also, I added a couple more minor commits.

bakkot · 2020-10-14T22:11:13Z

Editors talked about it; we're on board with this direction but need to review it.

michaelficarra · 2020-10-15T01:30:56Z

spec.html

-      <h1>Static Semantics: CodePointToUTF16CodeUnits ( _cp_ )</h1>
-      <p>The abstract operation CodePointToUTF16CodeUnits takes argument _cp_ (a numeric code point value). It performs the following steps when called:</p>
+    <emu-clause id="sec-codepointtostring" aoid="CodePointToString" oldids="sec-utf16encoding,sec-codepointtoutf16codeunits">
+      <h1>Static Semantics: CodePointToString ( _cp_ )</h1>


I actually still kinda prefer calling this AO CodePointToUTF16CodeUnits since it's really clear why it's needed. @tc39/ecma262-editors?

I'm not sure that makes it more clear - the direct analog here is String.fromCodePoint.

When I think about fromCodePoint, I am thinking about the fact that I put in a Number and get out a String. Here I want readers to think about the fact that they put in a code point and get out a sequence of code units of length potentially greater than 1.

"a sequence of code units of length potentially greater than 1" is also called "a string".

CharCodeToString would of course return a string of length 1, because a charcode represents a single code unit. A code point represents 1 or 2 code points, so it would return a string of length 1 or 2. I don't see why it needs emphasizing, or why "CodePointToUTF16CodeUnits" would make it clear that it can return a string of length < 2 (to me, "units" is plural, and thus implies "more than one").

The preamble could say that the operation returns a String consisting of 1 or 2 code units. Does that help?

@jmdyck I don't think that really addresses it. I'm thinking about the reader understanding what the function does at each call site. I think using the term "string" within the preamble or the steps is fine since you have the explicit description of its behaviour right there.

If we agree that unifying on Strings in favor of sequences of UTF16 code units is a good thing, and we did on the last editor call, then this AO should avoid UTF16CodeUnits in the name as that would be its own form of confusion.

The discussion seems to be stuck on whether the plural "Code Units" better primes the readers at callsites to think about the length of the result, which can be 1 or 2. I don't find the CodePointToUTF16CodeUnits name to be particularly clear on conveying that point.

There's also an argument that CodePointToUTF16CodeUnits is clearer on what the AO does than ToString. If that's the position, then it feels like you actually don't think unifying on Strings in favor of sequences of UTF16 code units is a good idea to begin with.

Maybe something like UTF16EncodeCodePoint instead?

I don't like UTF16EncodeCodePoint as much as CodePointToUTF16CodeUnits, but I'd be OK with it; it's better than the other suggestions so far.

jmdyck · 2020-10-15T16:05:54Z

(force-pushed to resolve merge conflicts)

jmdyck · 2020-11-29T05:12:58Z

force-pushed to:
(a) rebase to master, and
(b) replace "CodePointToString" with "UTF16EncodeCodePoint" (i.e., the PR now renames CodePointToUTF16CodeUnits to UTF16EncodeCodePoint)

(i.e., cases that are handled by the rule for chain productions)

This simplifies the algorithm for CharacterValue of LegacyOctalEscapeSequence, avoiding the detour through SV.

This simplifies the algorithm for CharacterValue of HexEscapeSequence. (Also, move the definition of MV of Hex4Digits here.)

In general, SV returns a sequence of code units. So, on the face of it, it's odd to require that the result of SV fall within a range of integers. Instead, it's simpler to phrase the requirement in terms of the MV of |Hex4Digits|, avoiding the detour though SV.

…tring (tc39#2018) ... rather than a sequence of code units. (See issue tc39#828.)

…c39#2018)

…tToString (tc39#2018)

Where the SV of a production only depends on the MV of the production's LHS, change it to use the 'parent' production, so that SV doesn't reach deeper than it needs to. (This commit would have been more at home in PR tc39#2018.)

jmdyck changed the title ~~Have SV, TV, TRV, and UTF16Encoding return a String rather than a sequence of code units~~ Editorial: Have SV, TV, TRV, and UTF16Encoding return a String rather than a sequence of code units May 29, 2020

jmdyck added the editorial change label May 29, 2020

jmdyck force-pushed the sequence-of-code-units branch from 0a7ae74 to b40f0ee Compare May 29, 2020 13:59

jmdyck marked this pull request as draft May 29, 2020 15:43

jmdyck force-pushed the sequence-of-code-units branch from b40f0ee to 469459e Compare May 29, 2020 17:18

jmdyck marked this pull request as ready for review May 29, 2020 17:21

michaelficarra reviewed Jun 2, 2020

View reviewed changes

michaelficarra approved these changes Jun 2, 2020

View reviewed changes

jmdyck force-pushed the sequence-of-code-units branch from 469459e to c9b32a5 Compare September 18, 2020 02:47

jmdyck changed the title ~~Editorial: Have SV, TV, TRV, and UTF16Encoding return a String rather than a sequence of code units~~ Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units Sep 18, 2020

ljharb requested review from syg, bakkot and a team September 18, 2020 05:29

bakkot added the editor call to be discussed in the next editor call label Oct 7, 2020

michaelficarra removed the editor call to be discussed in the next editor call label Oct 14, 2020

michaelficarra reviewed Oct 15, 2020

View reviewed changes

michaelficarra approved these changes Oct 15, 2020

View reviewed changes

jmdyck force-pushed the sequence-of-code-units branch from c9b32a5 to 475cd0b Compare October 15, 2020 16:05

jmdyck force-pushed the sequence-of-code-units branch from 475cd0b to 4c2d292 Compare November 29, 2020 05:10

ljharb approved these changes Nov 29, 2020

View reviewed changes

ljharb self-assigned this Nov 29, 2020

ljharb requested a review from a team November 29, 2020 06:23

bakkot approved these changes Nov 29, 2020

View reviewed changes

jmdyck added 3 commits November 28, 2020 23:18

Editorial: Delete 2 chain rules for SV (tc39#2018)

0b01bda

(i.e., cases that are handled by the rule for chain productions)

Editorial: Introduce MV of LegacyOctalEscapeSequence (tc39#2018)

1a22134

This simplifies the algorithm for CharacterValue of LegacyOctalEscapeSequence, avoiding the detour through SV.

Editorial: Introduce MV of HexEscapeSequence (tc39#2018)

c8eaf65

This simplifies the algorithm for CharacterValue of HexEscapeSequence. (Also, move the definition of MV of Hex4Digits here.)

jmdyck added 4 commits November 28, 2020 23:18

Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a S…

e28209f

…tring (tc39#2018) ... rather than a sequence of code units. (See issue tc39#828.)

Editorial: Rename CodePointToUTF16CodeUnits to UTF16EncodeCodePoint (t…

5c8edf7

…c39#2018)

Editorial: Explictly 'convert' numeric value to code unit in CodePoin…

d5948f7

…tToString (tc39#2018)

ljharb force-pushed the sequence-of-code-units branch from 4c2d292 to d5948f7 Compare November 29, 2020 07:18

ljharb merged commit d5948f7 into tc39:master Nov 29, 2020

jmdyck deleted the sequence-of-code-units branch November 29, 2020 15:17

jmdyck mentioned this pull request Feb 9, 2021

Editorial: "sequence of code units" != "String" ? #828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018

Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018

jmdyck commented May 29, 2020

michaelficarra Jun 2, 2020

jmdyck Jun 2, 2020

michaelficarra left a comment

jmdyck commented Sep 18, 2020

bakkot commented Oct 14, 2020

michaelficarra Oct 15, 2020

bakkot Oct 15, 2020

ljharb Oct 15, 2020

bakkot Oct 15, 2020

ljharb Oct 15, 2020

ljharb Oct 15, 2020

jmdyck Oct 15, 2020

michaelficarra Oct 15, 2020

syg Oct 15, 2020

bakkot Oct 15, 2020

jmdyck commented Oct 15, 2020

jmdyck commented Nov 29, 2020

Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018

Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018

Conversation

jmdyck commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelficarra left a comment

Choose a reason for hiding this comment

jmdyck commented Sep 18, 2020

bakkot commented Oct 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmdyck commented Oct 15, 2020

jmdyck commented Nov 29, 2020