Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units #2018

Merged
merged 7 commits into from
Nov 29, 2020

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented May 29, 2020

(See issue #828.)

The first 4 commits are independent refactorings that would be useful even if the main point of the PR is rejected. They're here because they reduce the number of cases that the final commit has to deal with.

@jmdyck jmdyck changed the title Have SV, TV, TRV, and UTF16Encoding return a String rather than a sequence of code units Editorial: Have SV, TV, TRV, and UTF16Encoding return a String rather than a sequence of code units May 29, 2020
@jmdyck jmdyck force-pushed the sequence-of-code-units branch from 0a7ae74 to b40f0ee Compare May 29, 2020 13:59
@jmdyck jmdyck marked this pull request as draft May 29, 2020 15:43
@jmdyck jmdyck force-pushed the sequence-of-code-units branch from b40f0ee to 469459e Compare May 29, 2020 17:18
@jmdyck jmdyck marked this pull request as ready for review May 29, 2020 17:21
spec.html Outdated
1. Let _cu1_ be floor((_cp_ - 0x10000) / 0x400) + 0xD800.
1. Let _cu2_ be ((_cp_ - 0x10000) modulo 0x400) + 0xDC00.
1. Return the code unit sequence consisting of _cu1_ followed by _cu2_.
1. Return the string-concatenation of _cu1_ and _cu2_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cu1 and cu2 are code units, and I'm not sure we should directly "concatenate" them. We should instead say

Return the String value consisting of cu1 followed by cu2.

or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of string-concatenation specifically allows that an argument can be a code unit, so the string-concatenation of two code units should be fine. (In fact, the String consisting of the code unit _first_ followed by the code unit _second_ was one of the specific forms that was replaced with an invocation of string-concatenation when it was introduced.)

What has bugged me about this operation is that _cu1_ and _cu2_ are set to the results of arithmetic formulae, which makes them either Number values or mathematical values (depending on how PR #2007 works out), and usually the spec draws a distinction between such values and code units. (E.g., using phrasing such as "the code unit whose value is [numeric value]" or "the numeric value of [code unit]".) So I'd be inclined to say, e.g.

1. Let _cu1_ be the code unit whose value is floor(...) + 0xD800.

That's actually independent of this PR, but we got close enough to it that I thought it was worth mentioning.

Copy link
Member

@michaelficarra michaelficarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@jmdyck jmdyck force-pushed the sequence-of-code-units branch from 469459e to c9b32a5 Compare September 18, 2020 02:47
@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 18, 2020

force-pushed to resolve merge conflicts.
Also, I added a couple more minor commits.

@jmdyck jmdyck changed the title Editorial: Have SV, TV, TRV, and UTF16Encoding return a String rather than a sequence of code units Editorial: Have SV, TV, TRV, and CodePointToUTF16CodeUnits return a String rather than a sequence of code units Sep 18, 2020
@ljharb ljharb requested review from syg, bakkot and a team September 18, 2020 05:29
@bakkot bakkot added the editor call to be discussed in the next editor call label Oct 7, 2020
@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Oct 14, 2020
@bakkot
Copy link
Contributor

bakkot commented Oct 14, 2020

Editors talked about it; we're on board with this direction but need to review it.

spec.html Outdated
<h1>Static Semantics: CodePointToUTF16CodeUnits ( _cp_ )</h1>
<p>The abstract operation CodePointToUTF16CodeUnits takes argument _cp_ (a numeric code point value). It performs the following steps when called:</p>
<emu-clause id="sec-codepointtostring" aoid="CodePointToString" oldids="sec-utf16encoding,sec-codepointtoutf16codeunits">
<h1>Static Semantics: CodePointToString ( _cp_ )</h1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually still kinda prefer calling this AO CodePointToUTF16CodeUnits since it's really clear why it's needed. @tc39/ecma262-editors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that makes it more clear - the direct analog here is String.fromCodePoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think about fromCodePoint, I am thinking about the fact that I put in a Number and get out a String. Here I want readers to think about the fact that they put in a code point and get out a sequence of code units of length potentially greater than 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a sequence of code units of length potentially greater than 1" is also called "a string".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CharCodeToString would of course return a string of length 1, because a charcode represents a single code unit. A code point represents 1 or 2 code points, so it would return a string of length 1 or 2. I don't see why it needs emphasizing, or why "CodePointToUTF16CodeUnits" would make it clear that it can return a string of length < 2 (to me, "units" is plural, and thus implies "more than one").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preamble could say that the operation returns a String consisting of 1 or 2 code units. Does that help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmdyck I don't think that really addresses it. I'm thinking about the reader understanding what the function does at each call site. I think using the term "string" within the preamble or the steps is fine since you have the explicit description of its behaviour right there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree that unifying on Strings in favor of sequences of UTF16 code units is a good thing, and we did on the last editor call, then this AO should avoid UTF16CodeUnits in the name as that would be its own form of confusion.

The discussion seems to be stuck on whether the plural "Code Units" better primes the readers at callsites to think about the length of the result, which can be 1 or 2. I don't find the CodePointToUTF16CodeUnits name to be particularly clear on conveying that point.

There's also an argument that CodePointToUTF16CodeUnits is clearer on what the AO does than ToString. If that's the position, then it feels like you actually don't think unifying on Strings in favor of sequences of UTF16 code units is a good idea to begin with.

Maybe something like UTF16EncodeCodePoint instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like UTF16EncodeCodePoint as much as CodePointToUTF16CodeUnits, but I'd be OK with it; it's better than the other suggestions so far.

@jmdyck jmdyck force-pushed the sequence-of-code-units branch from c9b32a5 to 475cd0b Compare October 15, 2020 16:05
@jmdyck
Copy link
Collaborator Author

jmdyck commented Oct 15, 2020

(force-pushed to resolve merge conflicts)

@jmdyck jmdyck force-pushed the sequence-of-code-units branch from 475cd0b to 4c2d292 Compare November 29, 2020 05:10
@jmdyck
Copy link
Collaborator Author

jmdyck commented Nov 29, 2020

force-pushed to:
(a) rebase to master, and
(b) replace "CodePointToString" with "UTF16EncodeCodePoint" (i.e., the PR now renames CodePointToUTF16CodeUnits to UTF16EncodeCodePoint)

@ljharb ljharb self-assigned this Nov 29, 2020
@ljharb ljharb requested a review from a team November 29, 2020 06:23
(i.e., cases that are handled by the rule for chain productions)
This simplifies the algorithm for CharacterValue of LegacyOctalEscapeSequence,
avoiding the detour through SV.
This simplifies the algorithm for CharacterValue of HexEscapeSequence.

(Also, move the definition of MV of Hex4Digits here.)
In general, SV returns a sequence of code units.
So, on the face of it, it's odd to require that
the result of SV fall within a range of integers.

Instead, it's simpler to phrase the requirement
in terms of the MV of |Hex4Digits|,
avoiding the detour though SV.
…tring (tc39#2018)

... rather than a sequence of code units.
(See issue tc39#828.)
@ljharb ljharb force-pushed the sequence-of-code-units branch from 4c2d292 to d5948f7 Compare November 29, 2020 07:18
@ljharb ljharb merged commit d5948f7 into tc39:master Nov 29, 2020
@jmdyck jmdyck deleted the sequence-of-code-units branch November 29, 2020 15:17
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Nov 30, 2020
Where the SV of a production only depends on
the MV of the production's LHS,
change it to use the 'parent' production,
so that SV doesn't reach deeper than it needs to.

(This commit would have been more at home in PR tc39#2018.)
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Jan 16, 2021
Where the SV of a production only depends on
the MV of the production's LHS,
change it to use the 'parent' production,
so that SV doesn't reach deeper than it needs to.

(This commit would have been more at home in PR tc39#2018.)
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Jan 16, 2021
Where the SV of a production only depends on
the MV of the production's LHS,
change it to use the 'parent' production,
so that SV doesn't reach deeper than it needs to.

(This commit would have been more at home in PR tc39#2018.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants