Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Eliminate order-disambiguation from Annex B Pattern-grammar #2445

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 24 additions & 12 deletions spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -830,7 +830,12 @@ <h1>Lookahead Restrictions</h1>
<p>In the above:</p>
<ul>
<li>_seq_ is a sequence of terminal symbols from the production's grammar; and</li>
<li>_set_ is a finite non-empty set of terminal sequences. For convenience, _set_ can also be written as a nonterminal from the production's grammar, in which case it represents the set of all terminal sequences to which that nonterminal could expand. It is considered an editorial error if the nonterminal could expand to infinitely many distinct terminal sequences.</li>
<li>_set_ is either:
<ul>
<li>an explicit non-empty set of terminal sequences. In the syntactic grammar, such a sequence can also include a "[no LineTerminator here]" phrase.</li>
<li>a non-empty sequence of symbols from the production's grammar, including one nonterminal. This sequence represents the set of all terminal sequences to which that sequence could expand. In the syntactic grammar, it is considered an editorial error if the nonterminal could expand to infinitely many distinct terminal sequences. In other grammars, it is considered an editorial error if the nonterminal's expansion is not a regular set (i.e., if it isn't equivalent to a regular expression over code points).</li>
</ul>
</li>
</ul>
<p>As an example, given the definitions:</p>
<emu-grammar type="definition" example>
Expand Down Expand Up @@ -50315,7 +50320,7 @@ <h2>Syntax</h2>

<emu-annex id="sec-regular-expressions-patterns">
<h1>Regular Expressions Patterns</h1>
<p>The syntax of <emu-xref href="#sec-patterns"></emu-xref> is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.</p>
<p>The syntax of <emu-xref href="#sec-patterns"></emu-xref> is modified and extended as follows.</p>
<p>This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [UnicodeMode] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [UnicodeMode] parameter present on the goal symbol.</p>
<h2>Syntax</h2>
<emu-grammar type="definition">
Expand Down Expand Up @@ -50345,13 +50350,13 @@ <h2>Syntax</h2>

ExtendedAtom[NamedCaptureGroups] ::
`.`
`\` AtomEscape[~UnicodeMode, ?NamedCaptureGroups]
`\` [lookahead == `c`]
`\` [lookahead &notin; { `b`, `B` }] AtomEscape[~UnicodeMode, ?NamedCaptureGroups]
`\` [lookahead == `c`] [lookahead != `c` AsciiLetter]
CharacterClass[~UnicodeMode, ~UnicodeSetsMode]
`(` GroupSpecifier[~UnicodeMode]? Disjunction[~UnicodeMode, ~UnicodeSetsMode, ?NamedCaptureGroups] `)`
`(?:` Disjunction[~UnicodeMode, ~UnicodeSetsMode, ?NamedCaptureGroups] `)`
InvalidBracedQuantifier
ExtendedPatternCharacter
[lookahead &notin; InvalidBracedQuantifier] ExtendedPatternCharacter

InvalidBracedQuantifier ::
`{` DecimalDigits[~Sep] `}`
Expand All @@ -50363,40 +50368,47 @@ <h2>Syntax</h2>

AtomEscape[UnicodeMode, NamedCaptureGroups] ::
[+UnicodeMode] DecimalEscape
[~UnicodeMode] DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is &le; CountLeftCapturingParensWithin(the |Pattern| containing |DecimalEscape|)]
[~UnicodeMode] ConstrainedDecimalEscape
CharacterClassEscape[?UnicodeMode]
CharacterEscape[?UnicodeMode, ?NamedCaptureGroups]
[+UnicodeMode] CharacterEscape[?UnicodeMode, ?NamedCaptureGroups]
[~UnicodeMode] [lookahead &notin; ConstrainedDecimalEscape] CharacterEscape[?UnicodeMode, ?NamedCaptureGroups]
[+NamedCaptureGroups] `k` GroupName[?UnicodeMode]

ConstrainedDecimalEscape ::
DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is &le; CountLeftCapturingParensWithin(the |Pattern| containing |DecimalEscape|)]

CharacterEscape[UnicodeMode, NamedCaptureGroups] ::
ControlEscape
`c` AsciiLetter
`0` [lookahead &notin; DecimalDigit]
HexEscapeSequence
RegExpUnicodeEscapeSequence[?UnicodeMode]
[~UnicodeMode] LegacyOctalEscapeSequence
IdentityEscape[?UnicodeMode, ?NamedCaptureGroups]
[lookahead &notin; HexEscapeSequence] [lookahead &notin; RegExpUnicodeEscapeSequence] IdentityEscape[?UnicodeMode, ?NamedCaptureGroups]

IdentityEscape[UnicodeMode, NamedCaptureGroups] ::
[+UnicodeMode] SyntaxCharacter
[+UnicodeMode] `/`
[~UnicodeMode] SourceCharacterIdentityEscape[?NamedCaptureGroups]

SourceCharacterIdentityEscape[NamedCaptureGroups] ::
[~NamedCaptureGroups] SourceCharacter but not `c`
[+NamedCaptureGroups] SourceCharacter but not one of `c` or `k`
[~NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like this be better?

Suggested change
[~NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W`
[~NamedCaptureGroups] [lookahead &notin; OctalDigit] [lookahead &notin; ControlEscape] [lookahead &notin; CharacterClassEscape[?UnicodeMode]] SourceCharacter

It'd be less repetitive, more robust to change, and more self-explanatory.

Come to think of it, most of the "but not"s in the grammar can just use lookaheads instead.

Copy link
Collaborator Author

@jmdyck jmdyck Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like this be better?

(Note that your suggestion is missing the c exclusion.)

It'd be less repetitive, more robust to change, and more self-explanatory.

I didn't make this very clear, but at that point in the commit, I'm actually suggesting two different ways of expressing the right-hand sides of the SourceCharacterIdentityEscape production. You're commenting on a line from the first way, but the corresponding line from the second way is

[~NamedCaptureGroups] SourceCharacter but not one of OctalDigit or ControlEscape or CharacterClassEscape or `c`

which has all the benefits of your suggestion with even less repetition.

Mind you, in third-commit syntax, your suggestion could be reduced to

[~NamedCaptureGroups] [lookahead !~ OctalDigit | ControlEscape | CharacterClassEscape | `c`] SourceCharacter

which is on par with my second way.

So it comes down to a preference between "but not" vs "lookahead". Personally, I think it's a bit easier to get the general case and then the exceptions, rather than the other way round.

Come to think of it, most of the "but not"s in the grammar can just use lookaheads instead.

I think they all could. The "but not" phrase goes back to ES1, so when ES3 introduced the "lookahead" phrase, I think they could have converted all the "but not"s to "lookahead"s. Maybe they didn't realize they could, or maybe they wanted to minimize change, maybe they just preferred to leave the "but not"s as is, or maybe something else.

[+NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W` `k`
`or`
[~NamedCaptureGroups] SourceCharacter but not one of OctalDigit or ControlEscape or CharacterClassEscape or `c`
[+NamedCaptureGroups] SourceCharacter but not one of OctalDigit or ControlEscape or CharacterClassEscape or `c` or `k`

ClassAtomNoDash[UnicodeMode, NamedCaptureGroups] ::
SourceCharacter but not one of `\` or `]` or `-`
`\` ClassEscape[?UnicodeMode, ?NamedCaptureGroups]
`\` [lookahead == `c`]
`\` [lookahead == `c`] [lookahead != `c` ClassControlLetter] [lookahead != `c` AsciiLetter]

ClassEscape[UnicodeMode, NamedCaptureGroups] ::
`b`
[+UnicodeMode] `-`
[~UnicodeMode] `c` ClassControlLetter
CharacterClassEscape[?UnicodeMode]
CharacterEscape[?UnicodeMode, ?NamedCaptureGroups]
[lookahead != `b`] CharacterEscape[?UnicodeMode, ?NamedCaptureGroups]

ClassControlLetter ::
DecimalDigit
Expand Down