Editorial: Improve specification of [RegExp]IdentifierNames #2392

jmdyck · 2021-04-24T03:57:41Z

Currently, the algorithm for StringValue of IdentifierName has the step:

Let _idTextUnescaped_ be the result of
  replacing any occurrences of `\\` |UnicodeEscapeSequence| in _idText_
  with the code point represented by the |UnicodeEscapeSequence|.

(And the algorithm for CapturingGroupName of RegExpIdentifierName has something quite similar.)

In #1552 (comment), @michaelficarra said of this step:

This language is a little less precise than I would prefer. Could we define a syntax-directed operation [to specify "the code point represented by the |UnicodeEscapeSequence|"]

This PR does so.

The first commit refactors the grammar slightly, and the second introduces 4 SDOs. The first makes the second easier, but it also has benefits of its own. I think they're easier to understand as separate commits.

The third commit just moves some pre-existing prose. The connection between the prose and the Early Error rules + SDOs could maybe be improved, but that wasn't my focus.

~~Note that the second commit uses list-concatenation in two spots, so if PR #2382 isn't accepted, I'll have to do something else there.~~

jmdyck · 2021-08-17T01:37:13Z

(force-pushed to rebase to master + resolve merge conflicts from #2411)

michaelficarra · 2021-09-27T17:17:53Z

I think you should combine the IdentifierCodePoint(s) SDOs with their RegExp variants. Then you could also take advantage of the chain rule to eliminate some of the definitions currently in RegExpIdentifierCodePoint.

michaelficarra

LGTM otherwise. This is a big improvement IMO.

jmdyck · 2021-09-27T20:47:25Z

I think you should combine the IdentifierCodePoint(s) SDOs with their RegExp variants.

Yeah, there are some parallels between the two, so it's tempting to look for ways to combine them. But I'm doubtful it would be a net benefit.

E.g., consider [RegExp]IdentifierCodePoints:

the algorithm for IdentifierName :: IdentifierStart refers to IdentifierStart, and
the algorithm for RegExpIdentifierName :: RegExpIdentifierStart refers to RegExpIdentifierStart,

so the two rules can't be combined; the best you can do is put them next to each other.

With [RegExp]IdentifierCodePoint, it's slightly better: a couple of the algorithms are the same, so you could join the <emu-grammar>s (e.g., the ones for IdentifierStart :: IdentifierStartChar and RegExpIdentifierStart :: IdentifierStartChar). However, with the other rules, you'd again just be moving them closer together.

I'm not saying there's no benefit to moving two similar rules close together -- if, in future, you need to change one, there's a fair chance you need to change the other as well, and you're more likely to recognize that if they're right next to each other. On the other hand, in this case, it seems like a low probability they'll need to change. Moreover, bringing them together means that at least one of them is taken from its 'home', which is a tradeoff.

(The above assumes you're not talking about merging productions (changing the grammar). That would allow more merging of SDOs, but it would have problems of its own.)

Then you could also take advantage of the chain rule to eliminate some of the definitions currently in RegExpIdentifierCodePoint.

Hm, I don't follow. Can you give an example?

michaelficarra · 2021-09-27T21:07:43Z

@jmdyck Looks like I got IdentifierStart and IdentifierStartChar mixed up in some of the definitions of (RegExp)IdentifierCodePoint and thought the chain rule would apply. After looking again, I don't see it.

If there's no benefit from the chain rule applying, I agree with your point about moving the relevant definitions of the SDO further from where it belongs. At that point, the minor benefit of reducing or grouping together the mostly-duplicated algorithms isn't enough to warrant the change.

…39#2392) Extract `IdentifierStartChar` from `IdentifierStart` and `RegExpIdentifierStart`. Extract `IdentifierPartChar` from `IdentifierPart` and `RegExpIdentifierPart`. This has 3 benefits: - We eliminate some repetition between the productions for Identifiers and RegExpIdentifiers. - We can simplify 4 Early Error rules involving escape sequences, because the constraint can now be expressed in terms of a single nonterminal, rather than a nonterminal plus some terminals. - We can eliminate the Early Error rule for `RegularExpressionFlags` by instead expressing its constraint in the grammar: in the production for `RegularExpressionFlags`, replace `IdentifierPart` with `IdentifierPartChar`. (As a consequence of the last point, this commit undefines the following id: sec-literals-regular-expression-literals-static-semantics-early-errors There didn't seem to be a sensible place to relocate it as an oldid.)

This commit introduces SDOs `IdentifierCodePoints` and `IdentifierCodePoint`. - This allows `StringValue` of _IdentifierName_ to be specified more precisely. - It also simplifies two Early Error rules (involving _UnicodeEscapeSequence_), since they can now be expressed as constraints on a code point, rather than having to be translated into the space of String values. ---- Similarly, this commit introduces SDOs `RegExpIdentifierCodePoints` and `RegExpIdentifierCodePoint`. - This allows `CapturingGroupName` of _RegExpIdentifierName_ to be specified more precisely. - It also simplifies two Early Error rules (involving surrogate pairs). (Note that the current algorithm for `CapturingGroupName` only 'normalizes' escape sequences, whereas this PR's algorithm also normalizes surrogate pairs. However, since the normalized text is immediately passed to `CodePointsToString`, the result should be the same. Given the Early Error rules for surrogate pairs, normalizing them made sense to me.)

... from 12.6 Names and Keywords down to 12.6.1 Identifier Names. I think this makes it clearer that this prose is mostly saying the same thing as the associated Early Error rules and SDOs.

Each of the two tests: early/1447683fba196181.js early/d52f769ab39372c7.js consists of a RegularExpressionLiteral in which the RegularExpressionFlags contains a UnicodeEscapeSequence. Formerly, this was disallowed via an Early Error, but with the merge of tc39/ecma262#2392, it is now disallowed via the grammar. Thus, these tests should move from 'early' to 'fail'.

…39#2392) Extract `IdentifierStartChar` from `IdentifierStart` and `RegExpIdentifierStart`. Extract `IdentifierPartChar` from `IdentifierPart` and `RegExpIdentifierPart`. This has 3 benefits: - We eliminate some repetition between the productions for Identifiers and RegExpIdentifiers. - We can simplify 4 Early Error rules involving escape sequences, because the constraint can now be expressed in terms of a single nonterminal, rather than a nonterminal plus some terminals. - We can eliminate the Early Error rule for `RegularExpressionFlags` by instead expressing its constraint in the grammar: in the production for `RegularExpressionFlags`, replace `IdentifierPart` with `IdentifierPartChar`. (As a consequence of the last point, this commit undefines the following id: sec-literals-regular-expression-literals-static-semantics-early-errors There didn't seem to be a sensible place to relocate it as an oldid.)

This commit introduces SDOs `IdentifierCodePoints` and `IdentifierCodePoint`. - This allows `StringValue` of _IdentifierName_ to be specified more precisely. - It also simplifies two Early Error rules (involving _UnicodeEscapeSequence_), since they can now be expressed as constraints on a code point, rather than having to be translated into the space of String values. ---- Similarly, this commit introduces SDOs `RegExpIdentifierCodePoints` and `RegExpIdentifierCodePoint`. - This allows `CapturingGroupName` of _RegExpIdentifierName_ to be specified more precisely. - It also simplifies two Early Error rules (involving surrogate pairs). (Note that the current algorithm for `CapturingGroupName` only 'normalizes' escape sequences, whereas this PR's algorithm also normalizes surrogate pairs. However, since the normalized text is immediately passed to `CodePointsToString`, the result should be the same. Given the Early Error rules for surrogate pairs, normalizing them made sense to me.)

... from 12.6 Names and Keywords down to 12.6.1 Identifier Names. I think this makes it clearer that this prose is mostly saying the same thing as the associated Early Error rules and SDOs.

jmdyck added the editorial change label May 16, 2021

ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21

jmdyck force-pushed the IdentifierCodePoints branch from 458dd79 to b9fcd20 Compare August 17, 2021 01:36

jmdyck force-pushed the IdentifierCodePoints branch from b9fcd20 to 0c5694a Compare September 24, 2021 20:47

michaelficarra approved these changes Sep 27, 2021

View reviewed changes

michaelficarra requested a review from a team September 27, 2021 17:19

bakkot approved these changes Sep 28, 2021

View reviewed changes

bakkot added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Sep 28, 2021

jmdyck added 3 commits September 27, 2021 18:34

Editorial: Move 2 paragraphs down one level (tc39#2392)

1901514

... from 12.6 Names and Keywords down to 12.6.1 Identifier Names. I think this makes it clearer that this prose is mostly saying the same thing as the associated Early Error rules and SDOs.

ljharb force-pushed the IdentifierCodePoints branch from 0c5694a to 1901514 Compare September 28, 2021 01:35

ljharb merged commit 1901514 into tc39:master Sep 28, 2021

jmdyck mentioned this pull request Sep 28, 2021

Move two files from 'early' to 'fail' after merge of PR #2392 tc39/test262-parser-tests#33

Open

jmdyck deleted the IdentifierCodePoints branch September 29, 2021 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: Improve specification of [RegExp]IdentifierNames #2392

Editorial: Improve specification of [RegExp]IdentifierNames #2392

jmdyck commented Apr 24, 2021 •

edited

Loading

jmdyck commented Aug 17, 2021

michaelficarra commented Sep 27, 2021

michaelficarra left a comment

jmdyck commented Sep 27, 2021

michaelficarra commented Sep 27, 2021

Editorial: Improve specification of [RegExp]IdentifierNames #2392

Editorial: Improve specification of [RegExp]IdentifierNames #2392

Conversation

jmdyck commented Apr 24, 2021 • edited Loading

jmdyck commented Aug 17, 2021

michaelficarra commented Sep 27, 2021

michaelficarra left a comment

Choose a reason for hiding this comment

jmdyck commented Sep 27, 2021

michaelficarra commented Sep 27, 2021

jmdyck commented Apr 24, 2021 •

edited

Loading