Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Improve specification of [RegExp]IdentifierNames #2392

Merged
merged 3 commits into from
Sep 28, 2021

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented Apr 24, 2021

Currently, the algorithm for StringValue of IdentifierName has the step:

Let _idTextUnescaped_ be the result of
  replacing any occurrences of `\\` |UnicodeEscapeSequence| in _idText_
  with the code point represented by the |UnicodeEscapeSequence|.

(And the algorithm for CapturingGroupName of RegExpIdentifierName has something quite similar.)

In #1552 (comment), @michaelficarra said of this step:

This language is a little less precise than I would prefer. Could we define a syntax-directed operation [to specify "the code point represented by the |UnicodeEscapeSequence|"]

This PR does so.

The first commit refactors the grammar slightly, and the second introduces 4 SDOs. The first makes the second easier, but it also has benefits of its own. I think they're easier to understand as separate commits.

The third commit just moves some pre-existing prose. The connection between the prose and the Early Error rules + SDOs could maybe be improved, but that wasn't my focus.

Note that the second commit uses list-concatenation in two spots, so if PR #2382 isn't accepted, I'll have to do something else there.

@ljharb ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21
@jmdyck jmdyck force-pushed the IdentifierCodePoints branch from 458dd79 to b9fcd20 Compare August 17, 2021 01:36
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 17, 2021

(force-pushed to rebase to master + resolve merge conflicts from #2411)

@jmdyck jmdyck force-pushed the IdentifierCodePoints branch from b9fcd20 to 0c5694a Compare September 24, 2021 20:47
@michaelficarra
Copy link
Member

I think you should combine the IdentifierCodePoint(s) SDOs with their RegExp variants. Then you could also take advantage of the chain rule to eliminate some of the definitions currently in RegExpIdentifierCodePoint.

Copy link
Member

@michaelficarra michaelficarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. This is a big improvement IMO.

@michaelficarra michaelficarra requested a review from a team September 27, 2021 17:19
@jmdyck
Copy link
Collaborator Author

jmdyck commented Sep 27, 2021

I think you should combine the IdentifierCodePoint(s) SDOs with their RegExp variants.

Yeah, there are some parallels between the two, so it's tempting to look for ways to combine them. But I'm doubtful it would be a net benefit.

E.g., consider [RegExp]IdentifierCodePoints:

  • the algorithm for IdentifierName :: IdentifierStart refers to IdentifierStart, and
  • the algorithm for RegExpIdentifierName :: RegExpIdentifierStart refers to RegExpIdentifierStart,

so the two rules can't be combined; the best you can do is put them next to each other.

With [RegExp]IdentifierCodePoint, it's slightly better: a couple of the algorithms are the same, so you could join the <emu-grammar>s (e.g., the ones for IdentifierStart :: IdentifierStartChar and RegExpIdentifierStart :: IdentifierStartChar). However, with the other rules, you'd again just be moving them closer together.

I'm not saying there's no benefit to moving two similar rules close together -- if, in future, you need to change one, there's a fair chance you need to change the other as well, and you're more likely to recognize that if they're right next to each other. On the other hand, in this case, it seems like a low probability they'll need to change. Moreover, bringing them together means that at least one of them is taken from its 'home', which is a tradeoff.

(The above assumes you're not talking about merging productions (changing the grammar). That would allow more merging of SDOs, but it would have problems of its own.)

Then you could also take advantage of the chain rule to eliminate some of the definitions currently in RegExpIdentifierCodePoint.

Hm, I don't follow. Can you give an example?

@michaelficarra
Copy link
Member

@jmdyck Looks like I got IdentifierStart and IdentifierStartChar mixed up in some of the definitions of (RegExp)IdentifierCodePoint and thought the chain rule would apply. After looking again, I don't see it.

If there's no benefit from the chain rule applying, I agree with your point about moving the relevant definitions of the SDO further from where it belongs. At that point, the minor benefit of reducing or grouping together the mostly-duplicated algorithms isn't enough to warrant the change.

@bakkot bakkot added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Sep 28, 2021
…39#2392)

Extract `IdentifierStartChar` from `IdentifierStart` and `RegExpIdentifierStart`.
Extract `IdentifierPartChar` from `IdentifierPart` and `RegExpIdentifierPart`.

This has 3 benefits:

- We eliminate some repetition between the productions for
  Identifiers and RegExpIdentifiers.

- We can simplify 4 Early Error rules involving escape sequences,
  because the constraint can now be expressed in terms of a single nonterminal,
  rather than a nonterminal plus some terminals.

- We can eliminate the Early Error rule for `RegularExpressionFlags`
  by instead expressing its constraint in the grammar:
  in the production for `RegularExpressionFlags`,
  replace `IdentifierPart` with `IdentifierPartChar`.

(As a consequence of the last point, this commit undefines the following id:
sec-literals-regular-expression-literals-static-semantics-early-errors
There didn't seem to be a sensible place to relocate it as an oldid.)
This commit introduces SDOs `IdentifierCodePoints` and `IdentifierCodePoint`.

- This allows `StringValue` of _IdentifierName_ to be specified more precisely.

- It also simplifies two Early Error rules (involving _UnicodeEscapeSequence_),
  since they can now be expressed as constraints on a code point, rather than
  having to be translated into the space of String values.

----

Similarly, this commit introduces SDOs `RegExpIdentifierCodePoints` and
`RegExpIdentifierCodePoint`.

- This allows `CapturingGroupName` of _RegExpIdentifierName_ to be specified
  more precisely.

- It also simplifies two Early Error rules (involving surrogate pairs).

(Note that the current algorithm for `CapturingGroupName` only 'normalizes'
escape sequences, whereas this PR's algorithm also normalizes surrogate pairs.
However, since the normalized text is immediately passed to `CodePointsToString`,
the result should be the same. Given the Early Error rules for surrogate pairs,
normalizing them made sense to me.)
... from 12.6 Names and Keywords
down to 12.6.1 Identifier Names.

I think this makes it clearer that this prose
is mostly saying the same thing as the associated
Early Error rules and SDOs.
@ljharb ljharb force-pushed the IdentifierCodePoints branch from 0c5694a to 1901514 Compare September 28, 2021 01:35
@ljharb ljharb merged commit 1901514 into tc39:master Sep 28, 2021
jmdyck added a commit to jmdyck/test262-parser-tests that referenced this pull request Sep 28, 2021
Each of the two tests:
  early/1447683fba196181.js
  early/d52f769ab39372c7.js
consists of a RegularExpressionLiteral in which the RegularExpressionFlags
contains a UnicodeEscapeSequence.

Formerly, this was disallowed via an Early Error,
but with the merge of tc39/ecma262#2392,
it is now disallowed via the grammar.
Thus, these tests should move from 'early' to 'fail'.
@jmdyck jmdyck deleted the IdentifierCodePoints branch September 29, 2021 02:14
mathiasbynens pushed a commit to mathiasbynens/ecma262 that referenced this pull request Oct 18, 2021
…39#2392)

Extract `IdentifierStartChar` from `IdentifierStart` and `RegExpIdentifierStart`.
Extract `IdentifierPartChar` from `IdentifierPart` and `RegExpIdentifierPart`.

This has 3 benefits:

- We eliminate some repetition between the productions for
  Identifiers and RegExpIdentifiers.

- We can simplify 4 Early Error rules involving escape sequences,
  because the constraint can now be expressed in terms of a single nonterminal,
  rather than a nonterminal plus some terminals.

- We can eliminate the Early Error rule for `RegularExpressionFlags`
  by instead expressing its constraint in the grammar:
  in the production for `RegularExpressionFlags`,
  replace `IdentifierPart` with `IdentifierPartChar`.

(As a consequence of the last point, this commit undefines the following id:
sec-literals-regular-expression-literals-static-semantics-early-errors
There didn't seem to be a sensible place to relocate it as an oldid.)
mathiasbynens pushed a commit to mathiasbynens/ecma262 that referenced this pull request Oct 18, 2021
This commit introduces SDOs `IdentifierCodePoints` and `IdentifierCodePoint`.

- This allows `StringValue` of _IdentifierName_ to be specified more precisely.

- It also simplifies two Early Error rules (involving _UnicodeEscapeSequence_),
  since they can now be expressed as constraints on a code point, rather than
  having to be translated into the space of String values.

----

Similarly, this commit introduces SDOs `RegExpIdentifierCodePoints` and
`RegExpIdentifierCodePoint`.

- This allows `CapturingGroupName` of _RegExpIdentifierName_ to be specified
  more precisely.

- It also simplifies two Early Error rules (involving surrogate pairs).

(Note that the current algorithm for `CapturingGroupName` only 'normalizes'
escape sequences, whereas this PR's algorithm also normalizes surrogate pairs.
However, since the normalized text is immediately passed to `CodePointsToString`,
the result should be the same. Given the Early Error rules for surrogate pairs,
normalizing them made sense to me.)
mathiasbynens pushed a commit to mathiasbynens/ecma262 that referenced this pull request Oct 18, 2021
... from 12.6 Names and Keywords
down to 12.6.1 Identifier Names.

I think this makes it clearer that this prose
is mostly saying the same thing as the associated
Early Error rules and SDOs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial change ready to merge Editors believe this PR needs no further reviews, and is ready to land.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants