From fcc09add690f6a972de77df9850bd2e13da3a6ff Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Sun, 4 Aug 2019 17:39:27 -0400 Subject: [PATCH 01/12] Normative: Make B.1.3 "HTML-like comments" normative (Part of Annex B reform, see PR #1595.) --- spec.html | 126 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 69 insertions(+), 57 deletions(-) diff --git a/spec.html b/spec.html index ea3cb4f174..21ad55a292 100644 --- a/spec.html +++ b/spec.html @@ -520,7 +520,7 @@

Context-Free Grammars

The Lexical and RegExp Grammars

A lexical grammar for ECMAScript is given in clause . This grammar has as its terminal symbols Unicode code points that conform to the rules for |SourceCharacter| defined in . It defines a set of productions, starting from the goal symbol |InputElementDiv|, |InputElementTemplateTail|, or |InputElementRegExp|, or |InputElementRegExpOrTemplateTail|, that describe how sequences of such code points are translated into a sequence of input elements.

-

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language. Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A |MultiLineComment| (that is, a comment of the form `/*`…`*/` regardless of whether it spans more than one line) is likewise simply discarded if it contains no line terminator; but if a |MultiLineComment| contains one or more line terminators, then it is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar.

+

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language. Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A |MultiLineComment| (that is, a comment of the form `/*`…`*/` that spans more than one line) is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar.

A RegExp grammar for ECMAScript is given in . This grammar also has as its terminal symbols the code points as defined by |SourceCharacter|. It defines a set of productions, starting from the goal symbol |Pattern|, that describe how sequences of code points are translated into regular expression patterns.

Productions of the lexical and RegExp grammars are distinguished by having two colons “::” as separating punctuation. The lexical and RegExp grammars share some productions.

@@ -16018,7 +16018,7 @@

Syntax

Line Terminators

Like white space code points, line terminator code points are used to improve source text readability and to separate tokens (indivisible lexical units) from each other. However, unlike white space code points, line terminators have some influence over the behaviour of the syntactic grammar. In general, line terminators may occur between any two tokens, but there are a few places where they are forbidden by the syntactic grammar. Line terminators also affect the process of automatic semicolon insertion (). A line terminator cannot occur within any token except a |StringLiteral|, |Template|, or |TemplateSubstitutionTail|. <LF> and <CR> line terminators cannot occur within a |StringLiteral| token except as part of a |LineContinuation|.

-

A line terminator can occur within a |MultiLineComment| but cannot occur within a |SingleLineComment|.

+

A line terminator must occur within a |MultiLineComment| but cannot occur within a |SingleLineDelimitedComment| or a |SingleLineComment|.

Line terminators are included in the set of white space code points that are matched by the `\\s` class in regular expressions.

The ECMAScript line terminator code points are listed in .

@@ -16104,15 +16104,21 @@

Syntax

Comments

Comments can be either single or multi-line. Multi-line comments cannot nest.

Because a single-line comment can contain any Unicode code point except a |LineTerminator| code point, and because of the general rule that a token is always as long as possible, a single-line comment always consists of all code points from the `//` marker to the end of the line. However, the |LineTerminator| at the end of the line is not considered to be part of the single-line comment; it is recognized separately by the lexical grammar and becomes part of the stream of input elements for the syntactic grammar. This point is very important, because it implies that the presence or absence of single-line comments does not affect the process of automatic semicolon insertion (see ).

-

Comments behave like white space and are discarded except that, if a |MultiLineComment| contains a line terminator code point, then the entire comment is considered to be a |LineTerminator| for purposes of parsing by the syntactic grammar.

+

Comments behave like white space and are discarded except that a |MultiLineComment| or a |SingleLineHTMLCloseComment| is considered to be a |LineTerminator| for purposes of parsing by the syntactic grammar.

Syntax

Comment :: MultiLineComment SingleLineComment + SingleLineHTMLOpenComment + SingleLineHTMLCloseComment + SingleLineDelimitedComment MultiLineComment :: - `/*` MultiLineCommentChars? `*/` + `/*` FirstCommentLine? LineTerminator MultiLineCommentChars? `*/` HTMLCloseComment? + + FirstCommentLine :: + SingleLineDelimitedCommentChars MultiLineCommentChars :: MultiLineNotAsteriskChar MultiLineCommentChars? @@ -16131,13 +16137,59 @@

Syntax

SingleLineComment :: `//` SingleLineCommentChars? + SingleLineHTMLOpenComment :: + `<!--` SingleLineCommentChars? + + SingleLineHTMLCloseComment :: + LineTerminatorSequence HTMLCloseComment + + HTMLCloseComment :: + WhiteSpaceSequence? SingleLineDelimitedCommentSequence? `-->` SingleLineCommentChars? + + SingleLineDelimitedCommentSequence :: + SingleLineDelimitedComment WhiteSpaceSequence? SingleLineDelimitedCommentSequence? + + WhiteSpaceSequence :: + WhiteSpace WhiteSpaceSequence? + SingleLineCommentChars :: SingleLineCommentChar SingleLineCommentChars? SingleLineCommentChar :: SourceCharacter but not LineTerminator + + SingleLineDelimitedComment :: + `/*` SingleLineDelimitedCommentChars? `*/` + + SingleLineDelimitedCommentChars :: + SingleLineNotAsteriskChar SingleLineDelimitedCommentChars? + `*` SingleLinePostAsteriskCommentChars? + + SingleLineNotAsteriskChar :: + SourceCharacter but not one of `*` or LineTerminator + + SingleLinePostAsteriskCommentChars :: + SingleLineNotForwardSlashOrAsteriskChar SingleLineDelimitedCommentChars? + `*` SingleLinePostAsteriskCommentChars? + + SingleLineNotForwardSlashOrAsteriskChar :: + SourceCharacter but not one of `/` or `*` or LineTerminator
-

A number of productions in this section are given alternative definitions in section

+ + +

Static Semantics: Early Errors

+ + SingleLineHTMLOpenComment :: + `<!--` SingleLineCommentChars? + + HTMLCloseComment :: + WhiteSpaceSequence? SingleLineDelimitedCommentSequence? `-->` SingleLineCommentChars? + +
    +
  • It is a Syntax Error if a |Module| contains the source code matching this production.
  • +
+ In a |Script|, this syntax is allowed, but deprecated. +
@@ -28298,9 +28350,6 @@

Forbidden Extensions

  • When processing strict mode code, the extensions defined in , , , and must not be supported.
  • -
  • - When parsing for the |Module| goal symbol, the lexical grammar extensions defined in must not be supported. -
  • |ImportCall| must not be extended. @@ -46088,13 +46137,24 @@

    Lexical Grammar

    + + + + + + + + + + + @@ -46506,55 +46566,7 @@

    Additional Syntax

    HTML-like Comments

    -

    The syntax and semantics of is extended as follows except that this extension is not allowed when parsing source code using the goal symbol |Module|:

    -

    Syntax

    - - Comment :: - MultiLineComment - SingleLineComment - SingleLineHTMLOpenComment - SingleLineHTMLCloseComment - SingleLineDelimitedComment - - MultiLineComment :: - `/*` FirstCommentLine? LineTerminator MultiLineCommentChars? `*/` HTMLCloseComment? - - FirstCommentLine :: - SingleLineDelimitedCommentChars - - SingleLineHTMLOpenComment :: - `<!--` SingleLineCommentChars? - - SingleLineHTMLCloseComment :: - LineTerminatorSequence HTMLCloseComment - - SingleLineDelimitedComment :: - `/*` SingleLineDelimitedCommentChars? `*/` - - HTMLCloseComment :: - WhiteSpaceSequence? SingleLineDelimitedCommentSequence? `-->` SingleLineCommentChars? - - SingleLineDelimitedCommentChars :: - SingleLineNotAsteriskChar SingleLineDelimitedCommentChars? - `*` SingleLinePostAsteriskCommentChars? - - SingleLineNotAsteriskChar :: - SourceCharacter but not one of `*` or LineTerminator - - SingleLinePostAsteriskCommentChars :: - SingleLineNotForwardSlashOrAsteriskChar SingleLineDelimitedCommentChars? - `*` SingleLinePostAsteriskCommentChars? - - SingleLineNotForwardSlashOrAsteriskChar :: - SourceCharacter but not one of `/` or `*` or LineTerminator - - WhiteSpaceSequence :: - WhiteSpace WhiteSpaceSequence? - - SingleLineDelimitedCommentSequence :: - SingleLineDelimitedComment WhiteSpaceSequence? SingleLineDelimitedCommentSequence? - -

    Similar to a |MultiLineComment| that contains a line terminator code point, a |SingleLineHTMLCloseComment| is considered to be a |LineTerminator| for purposes of parsing by the syntactic grammar.

    +

    The HTML-like comment syntax used to be normative optional outside |Module|s.

    From db18fce72d0433548c84583aff50a3b620ebc6e9 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Fri, 16 Aug 2019 16:12:24 -0400 Subject: [PATCH 02/12] Editorial: Split 'Patterns' clause into two clauses ... namely Syntax for Patterns and Static Semantics for Patterns And for consistency, rename Pattern Semantics to Runtime Semantics for Patterns ... so that clause-headers more precisely convey what content is where. --- spec.html | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/spec.html b/spec.html index 21ad55a292..3a75672908 100644 --- a/spec.html +++ b/spec.html @@ -34241,8 +34241,8 @@

    RegExp (Regular Expression) Objects

    -

    Patterns

    -

    The RegExp constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of |Pattern|.

    +

    Syntax for Patterns

    +

    The `RegExp` constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of |Pattern|.

    Syntax

    Pattern[UnicodeMode, N] :: @@ -34441,6 +34441,10 @@

    Syntax

    CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode]
    +
    + + +

    Static Semantics for Patterns

    A number of productions in this section are given alternative definitions in section .

    @@ -34856,10 +34860,7 @@

    Static Semantics: RegExpIdentifierCodePoint

    -

    Pattern Semantics

    - -

    This section is amended in .

    -
    +

    Runtime Semantics for Patterns

    A regular expression pattern is converted into an Abstract Closure using the process described below. An implementation is encouraged to use more efficient algorithms than the ones listed below, as long as the results are the same. The Abstract Closure is used as the value of a RegExp object's [[RegExpMatcher]] internal slot.

    A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `u`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point (). In either context, “character value” means the numeric value of the corresponding non-encoded code point.

    The syntax and semantics of |Pattern| is defined as if the source code for the |Pattern| was a List of |SourceCharacter| values where each |SourceCharacter| corresponds to a Unicode code point. If a BMP pattern contains a non-BMP |SourceCharacter| the entire pattern is encoded using UTF-16 and the individual code units of that encoding are used as the elements of the List.

    From 2100978f92b27b1d58689527038722c06105cf23 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Fri, 16 Aug 2019 16:13:28 -0400 Subject: [PATCH 03/12] Markup: change clause-ids to reflect changes in clause-titles ... in previous commit. --- spec.html | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/spec.html b/spec.html index 3a75672908..2abf9c75ba 100644 --- a/spec.html +++ b/spec.html @@ -521,7 +521,7 @@

    Context-Free Grammars

    The Lexical and RegExp Grammars

    A lexical grammar for ECMAScript is given in clause . This grammar has as its terminal symbols Unicode code points that conform to the rules for |SourceCharacter| defined in . It defines a set of productions, starting from the goal symbol |InputElementDiv|, |InputElementTemplateTail|, or |InputElementRegExp|, or |InputElementRegExpOrTemplateTail|, that describe how sequences of such code points are translated into a sequence of input elements.

    Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language. Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A |MultiLineComment| (that is, a comment of the form `/*`…`*/` that spans more than one line) is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar.

    -

    A RegExp grammar for ECMAScript is given in . This grammar also has as its terminal symbols the code points as defined by |SourceCharacter|. It defines a set of productions, starting from the goal symbol |Pattern|, that describe how sequences of code points are translated into regular expression patterns.

    +

    A RegExp grammar for ECMAScript is given in . This grammar also has as its terminal symbols the code points as defined by |SourceCharacter|. It defines a set of productions, starting from the goal symbol |Pattern|, that describe how sequences of code points are translated into regular expression patterns.

    Productions of the lexical and RegExp grammars are distinguished by having two colons “::” as separating punctuation. The lexical and RegExp grammars share some productions.

    @@ -17111,8 +17111,8 @@

    Regular Expression Literals

    A regular expression literal is an input element that is converted to a RegExp object (see ) each time the literal is evaluated. Two regular expression literals in a program evaluate to regular expression objects that never compare as `===` to each other even if the two literals' contents are identical. A RegExp object may also be created at runtime by `new RegExp` or calling the RegExp constructor as a function (see ).

    -

    The productions below describe the syntax for a regular expression literal and are used by the input element scanner to find the end of the regular expression literal. The source text comprising the |RegularExpressionBody| and the |RegularExpressionFlags| are subsequently parsed again using the more stringent ECMAScript Regular Expression grammar ().

    -

    An implementation may extend the ECMAScript Regular Expression grammar defined in , but it must not extend the |RegularExpressionBody| and |RegularExpressionFlags| productions defined below or the productions used by these productions.

    +

    The productions below describe the syntax for a regular expression literal and are used by the input element scanner to find the end of the regular expression literal. The source text comprising the |RegularExpressionBody| and the |RegularExpressionFlags| are subsequently parsed again using the more stringent ECMAScript Regular Expression grammar ().

    +

    An implementation may extend the ECMAScript Regular Expression grammar defined in , but it must not extend the |RegularExpressionBody| and |RegularExpressionFlags| productions defined below or the productions used by these productions.

    Syntax

    RegularExpressionLiteral :: @@ -28336,7 +28336,7 @@

    Forbidden Extensions

    The behaviour of built-in methods which are specified in ECMA-402, such as those named `toLocaleString`, must not be extended except as specified in ECMA-402.
  • - The RegExp pattern grammars in and must not be extended to recognize any of the source characters A-Z or a-z as |IdentityEscape[+UnicodeMode]| when the [UnicodeMode] grammar parameter is present. + The RegExp pattern grammars in and must not be extended to recognize any of the source characters A-Z or a-z as |IdentityEscape[+UnicodeMode]| when the [UnicodeMode] grammar parameter is present.
  • The Syntactic Grammar must not be extended in any manner that allows the token `:` to immediately follow source text that matches the |BindingIdentifier| nonterminal symbol. @@ -34240,7 +34240,7 @@

    RegExp (Regular Expression) Objects

    The form and functionality of regular expressions is modelled after the regular expression facility in the Perl 5 programming language.

    - +

    Syntax for Patterns

    The `RegExp` constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of |Pattern|.

    Syntax

    @@ -34859,7 +34859,7 @@

    Static Semantics: RegExpIdentifierCodePoint

    - +

    Runtime Semantics for Patterns

    A regular expression pattern is converted into an Abstract Closure using the process described below. An implementation is encouraged to use more efficient algorithms than the ones listed below, as long as the results are the same. The Abstract Closure is used as the value of a RegExp object's [[RegExpMatcher]] internal slot.

    A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `u`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point (). In either context, “character value” means the numeric value of the corresponding non-encoded code point.

    @@ -34927,8 +34927,8 @@

    Pattern

    1. Return a new Abstract Closure with parameters (_str_, _index_) that captures _m_ and performs the following steps when called: 1. Assert: Type(_str_) is String. 1. Assert: _index_ is a non-negative integer which is ≤ the length of _str_. - 1. If _Unicode_ is *true*, let _Input_ be ! StringToCodePoints(_str_). Otherwise, let _Input_ be a List whose elements are the code units that are the elements of _str_. _Input_ will be used throughout the algorithms in . Each element of _Input_ is considered to be a character. - 1. Let _InputLength_ be the number of characters contained in _Input_. This alias will be used throughout the algorithms in . + 1. If _Unicode_ is *true*, let _Input_ be ! StringToCodePoints(_str_). Otherwise, let _Input_ be a List whose elements are the code units that are the elements of _str_. _Input_ will be used throughout the algorithms in . Each element of _Input_ is considered to be a character. + 1. Let _InputLength_ be the number of characters contained in _Input_. This alias will be used throughout the algorithms in . 1. Let _listIndex_ be the index into _Input_ of the character that was obtained from element _index_ of _str_. 1. Let _c_ be a new Continuation with parameters (_y_) that captures nothing and performs the following steps when called: 1. Assert: _y_ is a State. @@ -34938,7 +34938,7 @@

    Pattern

    1. Return _m_(_x_, _c_). -

    A Pattern evaluates (“compiles”) to an Abstract Closure value. RegExpBuiltinExec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String, and, if it does match, what the values of the capturing parentheses would be. The algorithms in are designed so that compiling a pattern may throw a *SyntaxError* exception; on the other hand, once the pattern is successfully compiled, applying the resulting Abstract Closure to find a match in a String cannot throw an exception (except for any implementation-defined exceptions that can occur anywhere such as out-of-memory).

    +

    A Pattern evaluates (“compiles”) to an Abstract Closure value. RegExpBuiltinExec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String, and, if it does match, what the values of the capturing parentheses would be. The algorithms in are designed so that compiling a pattern may throw a *SyntaxError* exception; on the other hand, once the pattern is successfully compiled, applying the resulting Abstract Closure to find a match in a String cannot throw an exception (except for any implementation-defined exceptions that can occur anywhere such as out-of-memory).

    @@ -35845,7 +35845,7 @@

    1. Assert: _parseResult_ is a |Pattern| Parse Node. 1. Set _obj_.[[OriginalSource]] to _P_. 1. Set _obj_.[[OriginalFlags]] to _F_. - 1. Set _obj_.[[RegExpMatcher]] to the Abstract Closure that evaluates _parseResult_ by applying the semantics provided in using _patternCharacters_ as the pattern's List of |SourceCharacter| values and _F_ as the flag parameters. + 1. Set _obj_.[[RegExpMatcher]] to the Abstract Closure that evaluates _parseResult_ by applying the semantics provided in using _patternCharacters_ as the pattern's List of |SourceCharacter| values and _F_ as the flag parameters. 1. Perform ? Set(_obj_, *"lastIndex"*, *+0*𝔽, *true*). 1. Return _obj_. @@ -46572,7 +46572,7 @@

    HTML-like Comments

    Regular Expressions Patterns

    -

    The syntax of is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

    +

    The syntax of is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

    This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [UnicodeMode] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [UnicodeMode] parameter present on the goal symbol.

    Syntax

    @@ -46727,7 +46727,7 @@

    Static Semantics: CharacterValue

    Pattern Semantics

    -

    The semantics of is extended as follows:

    +

    The semantics of is extended as follows:

    Within reference to “Atom :: `(` GroupSpecifier Disjunction `)` ” are to be interpreted as meaning “Atom :: `(` GroupSpecifier Disjunction `)` ” or “ExtendedAtom :: `(` Disjunction `)` ”.

    Term () includes the following additional evaluation rules:

    From 5d7bfbfce666ad549f782b4e6d49757fa9c8f799 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Fri, 16 Aug 2019 23:15:10 -0400 Subject: [PATCH 04/12] Editorial: Rearrange "Syntax for Patterns" into 4 subsections ... namely: - Patterns - Group Specifiers - Character Classes - Escapes (This moves productions around, but doesn't alter them at all.) Also, rearrange Early Errors rules and runtime semantics rules to reflect the same order of productions. --- spec.html | 511 +++++++++++++++++++++++++++--------------------------- 1 file changed, 259 insertions(+), 252 deletions(-) diff --git a/spec.html b/spec.html index 2abf9c75ba..57734d77ca 100644 --- a/spec.html +++ b/spec.html @@ -34243,7 +34243,7 @@

    RegExp (Regular Expression) Objects

    Syntax for Patterns

    The `RegExp` constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of |Pattern|.

    -

    Syntax

    +

    Patterns

    Pattern[UnicodeMode, N] :: Disjunction[?UnicodeMode, ?N] @@ -34291,33 +34291,15 @@

    Syntax

    `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` `(` `?` `:` Disjunction[?UnicodeMode, ?N] `)` - SyntaxCharacter :: one of - `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `]` `{` `}` `|` - PatternCharacter :: SourceCharacter but not SyntaxCharacter - AtomEscape[UnicodeMode, N] :: - DecimalEscape - CharacterClassEscape[?UnicodeMode] - CharacterEscape[?UnicodeMode] - [+N] `k` GroupName[?UnicodeMode] - - CharacterEscape[UnicodeMode] :: - ControlEscape - `c` ControlLetter - `0` [lookahead ∉ DecimalDigit] - HexEscapeSequence - RegExpUnicodeEscapeSequence[?UnicodeMode] - IdentityEscape[?UnicodeMode] - - ControlEscape :: one of - `f` `n` `r` `t` `v` - - ControlLetter :: one of - `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m` `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z` - `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M` `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z` + SyntaxCharacter :: one of + `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `]` `{` `}` `|` +
    +

    Group Specifiers

    + GroupSpecifier[UnicodeMode] :: [empty] `?` GroupName[?UnicodeMode] @@ -34339,35 +34321,55 @@

    Syntax

    `\` RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate - RegExpUnicodeEscapeSequence[UnicodeMode] :: - [+UnicodeMode] `u` HexLeadSurrogate `\u` HexTrailSurrogate - [+UnicodeMode] `u` HexLeadSurrogate - [+UnicodeMode] `u` HexTrailSurrogate - [+UnicodeMode] `u` HexNonSurrogate - [~UnicodeMode] `u` Hex4Digits - [+UnicodeMode] `u{` CodePoint `}` - UnicodeLeadSurrogate :: > any Unicode code point in the inclusive range 0xD800 to 0xDBFF UnicodeTrailSurrogate :: > any Unicode code point in the inclusive range 0xDC00 to 0xDFFF
    -

    Each `\\u` |HexTrailSurrogate| for which the choice of associated `u` |HexLeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |HexLeadSurrogate| that would otherwise have no corresponding `\\u` |HexTrailSurrogate|.

    + +

    Character Classes

    - HexLeadSurrogate :: - Hex4Digits [> but only if the MV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF] + CharacterClass[UnicodeMode] :: + `[` [lookahead != `^`] ClassRanges[?UnicodeMode] `]` + `[` `^` ClassRanges[?UnicodeMode] `]` - HexTrailSurrogate :: - Hex4Digits [> but only if the MV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF] + ClassRanges[UnicodeMode] :: + [empty] + NonemptyClassRanges[?UnicodeMode] - HexNonSurrogate :: - Hex4Digits [> but only if the MV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF] + NonemptyClassRanges[UnicodeMode] :: + ClassAtom[?UnicodeMode] + ClassAtom[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] + ClassAtom[?UnicodeMode] `-` ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] - IdentityEscape[UnicodeMode] :: - [+UnicodeMode] SyntaxCharacter - [+UnicodeMode] `/` - [~UnicodeMode] SourceCharacter but not UnicodeIDContinue + NonemptyClassRangesNoDash[UnicodeMode] :: + ClassAtom[?UnicodeMode] + ClassAtomNoDash[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] + ClassAtomNoDash[?UnicodeMode] `-` ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] + + ClassAtom[UnicodeMode] :: + `-` + ClassAtomNoDash[?UnicodeMode] + + ClassAtomNoDash[UnicodeMode] :: + SourceCharacter but not one of `\` or `]` or `-` + `\` ClassEscape[?UnicodeMode] + + +

    Escapes

    + + ClassEscape[UnicodeMode] :: + `b` + [+UnicodeMode] `-` + CharacterClassEscape[?UnicodeMode] + CharacterEscape[?UnicodeMode] + + AtomEscape[UnicodeMode, N] :: + DecimalEscape + CharacterClassEscape[?UnicodeMode] + CharacterEscape[?UnicodeMode] + [+N] `k` GroupName[?UnicodeMode] DecimalEscape :: NonZeroDigit DecimalDigits[~Sep]? [lookahead ∉ DecimalDigit] @@ -34409,37 +34411,44 @@

    Syntax

    ControlLetter `_` - CharacterClass[UnicodeMode] :: - `[` [lookahead != `^`] ClassRanges[?UnicodeMode] `]` - `[` `^` ClassRanges[?UnicodeMode] `]` + CharacterEscape[UnicodeMode] :: + ControlEscape + `c` ControlLetter + `0` [lookahead ∉ DecimalDigit] + HexEscapeSequence + RegExpUnicodeEscapeSequence[?UnicodeMode] + IdentityEscape[?UnicodeMode] - ClassRanges[UnicodeMode] :: - [empty] - NonemptyClassRanges[?UnicodeMode] + ControlEscape :: one of + `f` `n` `r` `t` `v` - NonemptyClassRanges[UnicodeMode] :: - ClassAtom[?UnicodeMode] - ClassAtom[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] - ClassAtom[?UnicodeMode] `-` ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] + ControlLetter :: one of + `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m` `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z` + `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M` `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z` - NonemptyClassRangesNoDash[UnicodeMode] :: - ClassAtom[?UnicodeMode] - ClassAtomNoDash[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] - ClassAtomNoDash[?UnicodeMode] `-` ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] + RegExpUnicodeEscapeSequence[UnicodeMode] :: + [+UnicodeMode] `u` HexLeadSurrogate `\u` HexTrailSurrogate + [+UnicodeMode] `u` HexLeadSurrogate + [+UnicodeMode] `u` HexTrailSurrogate + [+UnicodeMode] `u` HexNonSurrogate + [~UnicodeMode] `u` Hex4Digits + [+UnicodeMode] `u{` CodePoint `}` +
    +

    Each `\\u` |HexTrailSurrogate| for which the choice of associated `u` |HexLeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |HexLeadSurrogate| that would otherwise have no corresponding `\\u` |HexTrailSurrogate|.

    + + HexLeadSurrogate :: + Hex4Digits [> but only if the MV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF] - ClassAtom[UnicodeMode] :: - `-` - ClassAtomNoDash[?UnicodeMode] + HexTrailSurrogate :: + Hex4Digits [> but only if the MV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF] - ClassAtomNoDash[UnicodeMode] :: - SourceCharacter but not one of `\` or `]` or `-` - `\` ClassEscape[?UnicodeMode] + HexNonSurrogate :: + Hex4Digits [> but only if the MV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF] - ClassEscape[UnicodeMode] :: - `b` - [+UnicodeMode] `-` - CharacterClassEscape[?UnicodeMode] - CharacterEscape[?UnicodeMode] + IdentityEscape[UnicodeMode] :: + [+UnicodeMode] SyntaxCharacter + [+UnicodeMode] `/` + [~UnicodeMode] SourceCharacter but not UnicodeIDContinue
    @@ -34470,58 +34479,58 @@

    Static Semantics: Early Errors

    It is a Syntax Error if the MV of the first |DecimalDigits| is larger than the MV of the second |DecimalDigits|.
  • - AtomEscape :: `k` GroupName + RegExpIdentifierStart :: `\` RegExpUnicodeEscapeSequence - AtomEscape :: DecimalEscape + RegExpIdentifierPart :: `\` RegExpUnicodeEscapeSequence - NonemptyClassRanges :: ClassAtom `-` ClassAtom ClassRanges + RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate + RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate + - NonemptyClassRangesNoDash :: ClassAtomNoDash `-` ClassAtom ClassRanges + NonemptyClassRanges :: ClassAtom `-` ClassAtom ClassRanges - RegExpIdentifierStart :: `\` RegExpUnicodeEscapeSequence + NonemptyClassRangesNoDash :: ClassAtomNoDash `-` ClassAtom ClassRanges - RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - - RegExpIdentifierPart :: `\` RegExpUnicodeEscapeSequence + AtomEscape :: DecimalEscape - RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate + AtomEscape :: `k` GroupName UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue @@ -34773,7 +34782,7 @@

    Static Semantics: CharacterValue

    1. Return the MV of |HexDigits|. - CharacterEscape :: IdentityEscape + CharacterEscape ::! IdentityEscape 1. Let _ch_ be the code point matched by |IdentityEscape|. 1. Return the code point value of _ch_. @@ -35464,150 +35473,6 @@

    - -

    AtomEscape

    -

    With parameter _direction_.

    -

    The production AtomEscape :: DecimalEscape evaluates as follows:

    - - 1. Evaluate |DecimalEscape| to obtain an integer _n_. - 1. Assert: _n_ ≤ _NcapturingParens_. - 1. Return ! BackreferenceMatcher(_n_, _direction_). - -

    The production AtomEscape :: CharacterEscape evaluates as follows:

    - - 1. Evaluate |CharacterEscape| to obtain a character _ch_. - 1. Let _A_ be a one-element CharSet containing the character _ch_. - 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). - -

    The production AtomEscape :: CharacterClassEscape evaluates as follows:

    - - 1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_. - 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). - - -

    An escape sequence of the form `\\` followed by a non-zero decimal number _n_ matches the result of the _n_th set of capturing parentheses (). It is an error if the regular expression has fewer than _n_ capturing parentheses. If the regular expression has _n_ or more capturing parentheses but the _n_th one is *undefined* because it has not captured anything, then the backreference always succeeds.

    -
    -

    The production AtomEscape :: `k` GroupName evaluates as follows:

    - - 1. Search the enclosing |Pattern| for an instance of a |GroupSpecifier| containing a |RegExpIdentifierName| which has a CapturingGroupName equal to the CapturingGroupName of the |RegExpIdentifierName| contained in |GroupName|. - 1. Assert: A unique such |GroupSpecifier| is found. - 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of the located |GroupSpecifier|. This is the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing the located |GroupSpecifier|, including its immediately enclosing |Atom|. - 1. Return ! BackreferenceMatcher(_parenIndex_, _direction_). - - - -

    - BackreferenceMatcher ( - _n_: a positive integer, - _direction_: 1 or -1, - ) -

    -
    -
    - - 1. Assert: _n_ ≥ 1. - 1. Return a new Matcher with parameters (_x_, _c_) that captures _n_ and _direction_ and performs the following steps when called: - 1. Assert: _x_ is a State. - 1. Assert: _c_ is a Continuation. - 1. Let _cap_ be _x_'s _captures_ List. - 1. Let _s_ be _cap_[_n_]. - 1. If _s_ is *undefined*, return _c_(_x_). - 1. Let _e_ be _x_'s _endIndex_. - 1. Let _len_ be the number of elements in _s_. - 1. Let _f_ be _e_ + _direction_ × _len_. - 1. If _f_ < 0 or _f_ > _InputLength_, return ~failure~. - 1. Let _g_ be min(_e_, _f_). - 1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_s_[_i_]) is not the same character value as Canonicalize(_Input_[_g_ + _i_]), return ~failure~. - 1. Let _y_ be the State (_f_, _cap_). - 1. Return _c_(_y_). - -
    -
    - - -

    CharacterEscape

    -

    The |CharacterEscape| productions evaluate as follows:

    - - CharacterEscape :: - ControlEscape - `c` ControlLetter - `0` [lookahead ∉ DecimalDigit] - HexEscapeSequence - RegExpUnicodeEscapeSequence - IdentityEscape - - - 1. Let _cv_ be the CharacterValue of this |CharacterEscape|. - 1. Return the character whose character value is _cv_. - -
    - - -

    DecimalEscape

    -

    The |DecimalEscape| productions evaluate as follows:

    - DecimalEscape :: NonZeroDigit DecimalDigits? - - 1. Return the CapturingGroupNumber of this |DecimalEscape|. - - -

    If `\\` is followed by a decimal number _n_ whose first digit is not `0`, then the escape sequence is considered to be a backreference. It is an error if _n_ is greater than the total number of left-capturing parentheses in the entire regular expression.

    -
    -
    - - -

    CharacterClassEscape

    -

    The production CharacterClassEscape :: `d` evaluates as follows:

    - - 1. Return the ten-element CharSet containing the characters `0` through `9` inclusive. - -

    The production CharacterClassEscape :: `D` evaluates as follows:

    - - 1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: `d` . - -

    The production CharacterClassEscape :: `s` evaluates as follows:

    - - 1. Return the CharSet containing all characters corresponding to a code point on the right-hand side of the |WhiteSpace| or |LineTerminator| productions. - -

    The production CharacterClassEscape :: `S` evaluates as follows:

    - - 1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: `s` . - -

    The production CharacterClassEscape :: `w` evaluates as follows:

    - - 1. Return _WordCharacters_. - -

    The production CharacterClassEscape :: `W` evaluates as follows:

    - - 1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: `w` . - -

    The production CharacterClassEscape :: `p{` UnicodePropertyValueExpression `}` evaluates as follows:

    - - 1. Return the CharSet containing all Unicode code points included in the CharSet returned by |UnicodePropertyValueExpression|. - -

    The production CharacterClassEscape :: `P{` UnicodePropertyValueExpression `}` evaluates as follows:

    - - 1. Return the CharSet containing all Unicode code points not included in the CharSet returned by |UnicodePropertyValueExpression|. - -

    The production UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue evaluates as follows:

    - - 1. Let _ps_ be SourceText of |UnicodePropertyName|. - 1. Let _p_ be ! UnicodeMatchProperty(_ps_). - 1. Assert: _p_ is a Unicode property name or property alias listed in the “Property name and aliases” column of . - 1. Let _vs_ be SourceText of |UnicodePropertyValue|. - 1. Let _v_ be ! UnicodeMatchPropertyValue(_p_, _vs_). - 1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value _v_. - -

    The production UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue evaluates as follows:

    - - 1. Let _s_ be SourceText of |LoneUnicodePropertyNameOrValue|. - 1. If ! UnicodeMatchPropertyValue(`General_Category`, _s_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of , then - 1. Return the CharSet containing all Unicode code points whose character database definition includes the property “General_Category” with value _s_. - 1. Let _p_ be ! UnicodeMatchProperty(_s_). - 1. Assert: _p_ is a binary Unicode property or binary property alias listed in the “Property name and aliases” column of . - 1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value “True”. - -
    -

    CharacterClass

    The production CharacterClass :: `[` ClassRanges `]` evaluates as follows:

    @@ -35756,6 +35621,150 @@

    ClassEscape

    A |ClassAtom| can use any of the escape sequences that are allowed in the rest of the regular expression except for `\\b`, `\\B`, and backreferences. Inside a |CharacterClass|, `\\b` means the backspace character, while `\\B` and backreferences raise errors. Using a backreference inside a |ClassAtom| causes an error.

    + + +

    AtomEscape

    +

    With parameter _direction_.

    +

    The production AtomEscape :: DecimalEscape evaluates as follows:

    + + 1. Evaluate |DecimalEscape| to obtain an integer _n_. + 1. Assert: _n_ ≤ _NcapturingParens_. + 1. Return ! BackreferenceMatcher(_n_, _direction_). + +

    The production AtomEscape :: CharacterClassEscape evaluates as follows:

    + + 1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_. + 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). + + +

    An escape sequence of the form `\\` followed by a non-zero decimal number _n_ matches the result of the _n_th set of capturing parentheses (). It is an error if the regular expression has fewer than _n_ capturing parentheses. If the regular expression has _n_ or more capturing parentheses but the _n_th one is *undefined* because it has not captured anything, then the backreference always succeeds.

    +
    +

    The production AtomEscape :: CharacterEscape evaluates as follows:

    + + 1. Evaluate |CharacterEscape| to obtain a character _ch_. + 1. Let _A_ be a one-element CharSet containing the character _ch_. + 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). + +

    The production AtomEscape :: `k` GroupName evaluates as follows:

    + + 1. Search the enclosing |Pattern| for an instance of a |GroupSpecifier| containing a |RegExpIdentifierName| which has a CapturingGroupName equal to the CapturingGroupName of the |RegExpIdentifierName| contained in |GroupName|. + 1. Assert: A unique such |GroupSpecifier| is found. + 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of the located |GroupSpecifier|. This is the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing the located |GroupSpecifier|, including its immediately enclosing |Atom|. + 1. Return ! BackreferenceMatcher(_parenIndex_, _direction_). + + + +

    + BackreferenceMatcher ( + _n_: a positive integer, + _direction_: 1 or -1, + ) +

    +
    +
    + + 1. Assert: _n_ ≥ 1. + 1. Return a new Matcher with parameters (_x_, _c_) that captures _n_ and _direction_ and performs the following steps when called: + 1. Assert: _x_ is a State. + 1. Assert: _c_ is a Continuation. + 1. Let _cap_ be _x_'s _captures_ List. + 1. Let _s_ be _cap_[_n_]. + 1. If _s_ is *undefined*, return _c_(_x_). + 1. Let _e_ be _x_'s _endIndex_. + 1. Let _len_ be the number of elements in _s_. + 1. Let _f_ be _e_ + _direction_ × _len_. + 1. If _f_ < 0 or _f_ > _InputLength_, return ~failure~. + 1. Let _g_ be min(_e_, _f_). + 1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_s_[_i_]) is not the same character value as Canonicalize(_Input_[_g_ + _i_]), return ~failure~. + 1. Let _y_ be the State (_f_, _cap_). + 1. Return _c_(_y_). + +
    +
    + + +

    DecimalEscape

    +

    The |DecimalEscape| productions evaluate as follows:

    + DecimalEscape :: NonZeroDigit DecimalDigits? + + 1. Return the CapturingGroupNumber of this |DecimalEscape|. + + +

    If `\\` is followed by a decimal number _n_ whose first digit is not `0`, then the escape sequence is considered to be a backreference. It is an error if _n_ is greater than the total number of left-capturing parentheses in the entire regular expression.

    +
    +
    + + +

    CharacterClassEscape

    +

    The production CharacterClassEscape :: `d` evaluates as follows:

    + + 1. Return the ten-element CharSet containing the characters `0` through `9` inclusive. + +

    The production CharacterClassEscape :: `D` evaluates as follows:

    + + 1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: `d` . + +

    The production CharacterClassEscape :: `s` evaluates as follows:

    + + 1. Return the CharSet containing all characters corresponding to a code point on the right-hand side of the |WhiteSpace| or |LineTerminator| productions. + +

    The production CharacterClassEscape :: `S` evaluates as follows:

    + + 1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: `s` . + +

    The production CharacterClassEscape :: `w` evaluates as follows:

    + + 1. Return _WordCharacters_. + +

    The production CharacterClassEscape :: `W` evaluates as follows:

    + + 1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: `w` . + +

    The production CharacterClassEscape :: `p{` UnicodePropertyValueExpression `}` evaluates as follows:

    + + 1. Return the CharSet containing all Unicode code points included in the CharSet returned by |UnicodePropertyValueExpression|. + +

    The production CharacterClassEscape :: `P{` UnicodePropertyValueExpression `}` evaluates as follows:

    + + 1. Return the CharSet containing all Unicode code points not included in the CharSet returned by |UnicodePropertyValueExpression|. + +

    The production UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue evaluates as follows:

    + + 1. Let _ps_ be SourceText of |UnicodePropertyName|. + 1. Let _p_ be ! UnicodeMatchProperty(_ps_). + 1. Assert: _p_ is a Unicode property name or property alias listed in the “Property name and aliases” column of . + 1. Let _vs_ be SourceText of |UnicodePropertyValue|. + 1. Let _v_ be ! UnicodeMatchPropertyValue(_p_, _vs_). + 1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value _v_. + +

    The production UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue evaluates as follows:

    + + 1. Let _s_ be SourceText of |LoneUnicodePropertyNameOrValue|. + 1. If ! UnicodeMatchPropertyValue(`General_Category`, _s_) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of , then + 1. Return the CharSet containing all Unicode code points whose character database definition includes the property “General_Category” with value _s_. + 1. Let _p_ be ! UnicodeMatchProperty(_s_). + 1. Assert: _p_ is a binary Unicode property or binary property alias listed in the “Property name and aliases” column of . + 1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value “True”. + +
    + + +

    CharacterEscape

    +

    The |CharacterEscape| productions evaluate as follows:

    + + CharacterEscape :: + ControlEscape + `c` ControlLetter + `0` [lookahead ∉ DecimalDigit] + HexEscapeSequence + RegExpUnicodeEscapeSequence + IdentityEscape + + + 1. Let _cv_ be the CharacterValue of this |CharacterEscape|. + 1. Return the character whose character value is _cv_. + +
    @@ -46514,26 +46523,23 @@

    Regular Expressions

    - - - - - + - -

    Each `\\u` |HexTrailSurrogate| for which the choice of associated `u` |HexLeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |HexLeadSurrogate| that would otherwise have no corresponding `\\u` |HexTrailSurrogate|.

    -

     

    - - - - + + + + + + + + @@ -46544,13 +46550,14 @@

    Regular Expressions

    - - - - - - - + + + + + + + + From 9ce84f7a62aa664ef10764bb6c8d356d5f9d0b03 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Mon, 19 Aug 2019 20:36:45 -0400 Subject: [PATCH 05/12] Normative: Make B.1.4 "Regular Expressions Patterns" normative (Merge its Syntax, Static Semantics, and Runtime Semantics into the main body.) (Part of Annex B reform, see PR #1595.) --- spec.html | 506 ++++++++++++++++++++---------------------------------- 1 file changed, 190 insertions(+), 316 deletions(-) diff --git a/spec.html b/spec.html index 57734d77ca..56ec0a727c 100644 --- a/spec.html +++ b/spec.html @@ -28336,7 +28336,7 @@

    Forbidden Extensions

    The behaviour of built-in methods which are specified in ECMA-402, such as those named `toLocaleString`, must not be extended except as specified in ECMA-402.
  • - The RegExp pattern grammars in and must not be extended to recognize any of the source characters A-Z or a-z as |IdentityEscape[+UnicodeMode]| when the [UnicodeMode] grammar parameter is present. + The RegExp pattern grammars in must not be extended to recognize any of the source characters A-Z or a-z as |IdentityEscape| when the [UnicodeMode] grammar parameter is present.
  • The Syntactic Grammar must not be extended in any manner that allows the token `:` to immediately follow source text that matches the |BindingIdentifier| nonterminal symbol. @@ -34243,6 +34243,7 @@

    RegExp (Regular Expression) Objects

    Syntax for Patterns

    The `RegExp` constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of |Pattern|.

    +

    Some of these productions (indicated by “::!”) introduce ambiguities that are broken by the ordering of alternatives. When parsing using such productions, each alternative is considered only if previous alternatives do not match.

    Patterns

    Pattern[UnicodeMode, N] :: @@ -34256,21 +34257,30 @@

    Patterns

    [empty] Alternative[?UnicodeMode, ?N] Term[?UnicodeMode, ?N] - Term[UnicodeMode, N] :: - Assertion[?UnicodeMode, ?N] - Atom[?UnicodeMode, ?N] - Atom[?UnicodeMode, ?N] Quantifier + Term[UnicodeMode, N] ::! + [+UnicodeMode] Assertion[+UnicodeMode, ?N] + [+UnicodeMode] Atom[+UnicodeMode, ?N] Quantifier + [+UnicodeMode] Atom[+UnicodeMode, ?N] + [~UnicodeMode] QuantifiableAssertion[?N] Quantifier + [~UnicodeMode] Assertion[~UnicodeMode, ?N] + [~UnicodeMode] ExtendedAtom[?N] Quantifier + [~UnicodeMode] ExtendedAtom[?N] Assertion[UnicodeMode, N] :: `^` `$` `\` `b` `\` `B` - `(` `?` `=` Disjunction[?UnicodeMode, ?N] `)` - `(` `?` `!` Disjunction[?UnicodeMode, ?N] `)` + [+UnicodeMode] `(` `?` `=` Disjunction[+UnicodeMode, ?N] `)` + [+UnicodeMode] `(` `?` `!` Disjunction[+UnicodeMode, ?N] `)` + [~UnicodeMode] QuantifiableAssertion[?N] `(` `?` `<=` Disjunction[?UnicodeMode, ?N] `)` `(` `?` `<!` Disjunction[?UnicodeMode, ?N] `)` + QuantifiableAssertion[N] :: + `(` `?` `=` Disjunction[~UnicodeMode, ?N] `)` + `(` `?` `!` Disjunction[~UnicodeMode, ?N] `)` + Quantifier :: QuantifierPrefix QuantifierPrefix `?` @@ -34283,6 +34293,16 @@

    Patterns

    `{` DecimalDigits[~Sep] `,` `}` `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}` + ExtendedAtom[N] ::! + `.` + `\` AtomEscape[~UnicodeMode, ?N] + `\` [lookahead == `c`] + CharacterClass[~UnicodeMode] + `(` Disjunction[~UnicodeMode, ?N] `)` + `(` `?` `:` Disjunction[~UnicodeMode, ?N] `)` + InvalidBracedQuantifier + ExtendedPatternCharacter + Atom[UnicodeMode, N] :: PatternCharacter `.` @@ -34291,6 +34311,14 @@

    Patterns

    `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` `(` `?` `:` Disjunction[?UnicodeMode, ?N] `)` + InvalidBracedQuantifier :: + `{` DecimalDigits[~Sep] `}` + `{` DecimalDigits[~Sep] `,` `}` + `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}` + + ExtendedPatternCharacter :: + SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|` + PatternCharacter :: SourceCharacter but not SyntaxCharacter @@ -34352,23 +34380,30 @@

    Character Classes

    `-` ClassAtomNoDash[?UnicodeMode] - ClassAtomNoDash[UnicodeMode] :: + ClassAtomNoDash[UnicodeMode, N] ::! SourceCharacter but not one of `\` or `]` or `-` - `\` ClassEscape[?UnicodeMode] + `\` ClassEscape[?UnicodeMode, ?N] + `\` [lookahead == `c`]

    Escapes

    - ClassEscape[UnicodeMode] :: + ClassEscape[UnicodeMode, N] ::! `b` [+UnicodeMode] `-` + [~UnicodeMode] `c` ClassControlLetter CharacterClassEscape[?UnicodeMode] - CharacterEscape[?UnicodeMode] + CharacterEscape[?UnicodeMode, ?N] + + ClassControlLetter :: + DecimalDigit + `_` - AtomEscape[UnicodeMode, N] :: - DecimalEscape + AtomEscape[UnicodeMode, N] ::! + [+UnicodeMode] DecimalEscape + [~UnicodeMode] DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is ≤ _NcapturingParens_] CharacterClassEscape[?UnicodeMode] - CharacterEscape[?UnicodeMode] + CharacterEscape[?UnicodeMode, ?N] [+N] `k` GroupName[?UnicodeMode] DecimalEscape :: @@ -34411,13 +34446,14 @@

    Escapes

    ControlLetter `_` - CharacterEscape[UnicodeMode] :: + CharacterEscape[UnicodeMode, N] ::! ControlEscape `c` ControlLetter `0` [lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence[?UnicodeMode] - IdentityEscape[?UnicodeMode] + [~UnicodeMode] LegacyOctalEscapeSequence + IdentityEscape[?UnicodeMode, ?N] ControlEscape :: one of `f` `n` `r` `t` `v` @@ -34445,25 +34481,36 @@

    Escapes

    HexNonSurrogate :: Hex4Digits [> but only if the MV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF] - IdentityEscape[UnicodeMode] :: + IdentityEscape[UnicodeMode, N] :: [+UnicodeMode] SyntaxCharacter [+UnicodeMode] `/` - [~UnicodeMode] SourceCharacter but not UnicodeIDContinue + [~UnicodeMode] SourceCharacterIdentityEscape[?N] + + SourceCharacterIdentityEscape[N] :: + [~N] SourceCharacter but not `c` + [+N] SourceCharacter but not one of `c` or `k`
    + +

    Patterns that use the following productions are allowed, but deprecated:

    + + Term ::! QuantifiableAssertion Quantifier + + ExtendedAtom ::! `\` [lookahead == `c`] + + ClassAtomNoDash ::! `\` [lookahead == `c`] + + ClassEscape ::! `c` ClassControlLetter + + CharacterEscape ::! LegacyOctalEscapeSequence + +

    Static Semantics for Patterns

    - -

    A number of productions in this section are given alternative definitions in section .

    -
    - - +

    Static Semantics: Early Errors

    - -

    This section is amended in .

    -
    Pattern :: Disjunction
    • @@ -34479,6 +34526,12 @@

      Static Semantics: Early Errors

      It is a Syntax Error if the MV of the first |DecimalDigits| is larger than the MV of the second |DecimalDigits|.
    + ExtendedAtom ::! InvalidBracedQuantifier +
      +
    • + It is a Syntax Error if any source text matches this rule. +
    • +
    RegExpIdentifierStart :: `\` RegExpUnicodeEscapeSequence
    • @@ -34506,7 +34559,7 @@

      Static Semantics: Early Errors

      NonemptyClassRanges :: ClassAtom `-` ClassAtom ClassRanges
      • - It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true*. + It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* and this production has a [UnicodeMode] parameter.
      • It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *false* and IsCharacterClass of the second |ClassAtom| is *false* and the CharacterValue of the first |ClassAtom| is larger than the CharacterValue of the second |ClassAtom|. @@ -34515,19 +34568,19 @@

        Static Semantics: Early Errors

        NonemptyClassRangesNoDash :: ClassAtomNoDash `-` ClassAtom ClassRanges
        • - It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *true* or IsCharacterClass of |ClassAtom| is *true*. + It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *true* or IsCharacterClass of |ClassAtom| is *true* and this production has a [UnicodeMode] parameter.
        • It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *false* and IsCharacterClass of |ClassAtom| is *false* and the CharacterValue of |ClassAtomNoDash| is larger than the CharacterValue of |ClassAtom|.
        - AtomEscape :: DecimalEscape + AtomEscape ::! DecimalEscape
        • It is a Syntax Error if the CapturingGroupNumber of |DecimalEscape| is larger than _NcapturingParens_ ().
        - AtomEscape :: `k` GroupName + AtomEscape ::! `k` GroupName
        • It is a Syntax Error if the enclosing |Pattern| does not contain a |GroupSpecifier| with an enclosed |RegExpIdentifierName| whose CapturingGroupName equals the CapturingGroupName of the |RegExpIdentifierName| of this production's |GroupName|. @@ -34554,9 +34607,6 @@

          Static Semantics: Early Errors

          Static Semantics: CapturingGroupNumber

          - -

          This section is amended in .

          -
          DecimalEscape :: NonZeroDigit 1. Return the MV of |NonZeroDigit|. @@ -34569,40 +34619,36 @@

          Static Semantics: CapturingGroupNumber

          The definitions of “the MV of |NonZeroDigit|” and “the MV of |DecimalDigits|” are in .

          - +

          Static Semantics: IsCharacterClass

          - -

          This section is amended in .

          -
          ClassAtom :: `-` - ClassAtomNoDash :: SourceCharacter but not one of `\` or `]` or `-` + ClassAtomNoDash ::! SourceCharacter but not one of `\` or `]` or `-` - ClassEscape :: `b` + ClassAtomNoDash ::! `\` [lookahead == `c`] - ClassEscape :: `-` + ClassEscape ::! `b` - ClassEscape :: CharacterEscape + ClassEscape ::! `-` + + ClassEscape ::! CharacterEscape 1. Return *false*. - ClassEscape :: CharacterClassEscape + ClassEscape ::! CharacterClassEscape 1. Return *true*.
          - +

          Static Semantics: CharacterValue

          - -

          This section is amended in .

          -
          ClassAtom :: `-` @@ -34610,25 +34656,37 @@

          Static Semantics: CharacterValue

          1. Return the code point value of U+002D (HYPHEN-MINUS).
          - ClassAtomNoDash :: SourceCharacter but not one of `\` or `]` or `-` + ClassAtomNoDash ::! SourceCharacter but not one of `\` or `]` or `-` 1. Let _ch_ be the code point matched by |SourceCharacter|. 1. Return the code point value of _ch_. - ClassEscape :: `b` + ClassAtomNoDash ::! `\` [lookahead == `c`] + + + 1. Return the code point value of U+005C (REVERSE SOLIDUS). + + + ClassEscape ::! `b` 1. Return the code point value of U+0008 (BACKSPACE). - ClassEscape :: `-` + ClassEscape ::! `-` 1. Return the code point value of U+002D (HYPHEN-MINUS). - CharacterEscape :: ControlEscape + ClassEscape ::! `c` ClassControlLetter + + 1. Let _ch_ be the code point matched by |ClassControlLetter|. + 1. Let _i_ be _ch_'s code point value. + 1. Return the remainder of dividing _i_ by 32. + + CharacterEscape ::! ControlEscape 1. Return the code point value according to . @@ -34740,23 +34798,27 @@

          Static Semantics: CharacterValue

          - CharacterEscape :: `c` ControlLetter + CharacterEscape ::! `c` ControlLetter 1. Let _ch_ be the code point matched by |ControlLetter|. 1. Let _i_ be _ch_'s code point value. 1. Return the remainder of dividing _i_ by 32. - CharacterEscape :: `0` [lookahead ∉ DecimalDigit] + CharacterEscape ::! `0` [lookahead ∉ DecimalDigit] 1. Return the code point value of U+0000 (NULL).

          `\\0` represents the <NUL> character and cannot be followed by a decimal digit.

          - CharacterEscape :: HexEscapeSequence + CharacterEscape ::! HexEscapeSequence 1. Return the MV of |HexEscapeSequence|. + CharacterEscape ::! LegacyOctalEscapeSequence + + 1. Return the MV of |LegacyOctalEscapeSequence| (see ). + RegExpUnicodeEscapeSequence :: `u` HexLeadSurrogate `\u` HexTrailSurrogate 1. Let _lead_ be the CharacterValue of |HexLeadSurrogate|. @@ -34868,7 +34930,7 @@

          Static Semantics: RegExpIdentifierCodePoint

          - +

          Runtime Semantics for Patterns

          A regular expression pattern is converted into an Abstract Closure using the process described below. An implementation is encouraged to use more efficient algorithms than the ones listed below, as long as the results are the same. The Abstract Closure is used as the value of a RegExp object's [[RegExpMatcher]] internal slot.

          A |Pattern| is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a `u`. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. A Unicode pattern matches against a String interpreted as consisting of Unicode code points encoded using UTF-16. In the context of describing the behaviour of a BMP pattern “character” means a single 16-bit Unicode BMP code point. In the context of describing the behaviour of a Unicode pattern “character” means a UTF-16 encoded code point (). In either context, “character value” means the numeric value of the corresponding non-encoded code point.

          @@ -35023,18 +35085,18 @@

          Alternative

          Term

          With parameter _direction_.

          -

          The production Term :: Assertion evaluates as follows:

          +

          The production Term ::! Assertion evaluates as follows:

          1. Return the Matcher that is the result of evaluating |Assertion|.

          The resulting Matcher is independent of _direction_.

          -

          The production Term :: Atom evaluates as follows:

          +

          The production Term ::! Atom evaluates as follows:

          1. Return the Matcher that is the result of evaluating |Atom| with argument _direction_. -

          The production Term :: Atom Quantifier evaluates as follows:

          +

          The production Term ::! Atom Quantifier evaluates as follows:

          1. Evaluate |Atom| with argument _direction_ to obtain a Matcher _m_. 1. Evaluate |Quantifier| to obtain the three results: a non-negative integer _min_, a non-negative integer (or +∞) _max_, and Boolean _greedy_. @@ -35046,6 +35108,11 @@

          Term

          1. Assert: _c_ is a Continuation. 1. Return ! RepeatMatcher(_m_, _min_, _max_, _greedy_, _x_, _c_, _parenIndex_, _parenCount_).
          +

          ----

          +

          In the above algorithm, references to Atom :: `(` GroupSpecifier Disjunction `)` are to be interpreted as meaning Atom :: `(` GroupSpecifier Disjunction `)` or ExtendedAtom ::! `(` Disjunction `)` .

          +

          The production Term ::! QuantifiableAssertion Quantifier evaluates the same as the production Term ::! Atom Quantifier but with |QuantifiableAssertion| substituted for |Atom|.

          +

          The production Term ::! ExtendedAtom Quantifier evaluates the same as the production Term ::! Atom Quantifier but with |ExtendedAtom| substituted for |Atom|.

          +

          The production Term ::! ExtendedAtom evaluates the same as the production Term ::! Atom but with |ExtendedAtom| substituted for |Atom|.

          @@ -35203,6 +35270,11 @@

          Assertion

          1. If _r_ is not ~failure~, return ~failure~. 1. Return _c_(_x_).
          +

          The production Assertion :: QuantifiableAssertion evaluates as follows:

          + + 1. Evaluate |QuantifiableAssertion| to obtain a Matcher _m_. + 1. Return _m_. +

          The production Assertion :: `(` `?` `<=` Disjunction `)` evaluates as follows:

          1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. @@ -35233,6 +35305,8 @@

          Assertion

          1. If _r_ is not ~failure~, return ~failure~. 1. Return _c_(_x_).
          +

          ----

          +

          The evaluation rules for the Assertion :: `(` `?` `=` Disjunction `)` and Assertion :: `(` `?` `!` Disjunction `)` productions are also used for the |QuantifiableAssertion| productions, but with |QuantifiableAssertion| substituted for |Assertion|.

          @@ -35312,6 +35386,11 @@

          Atom

          1. Return the Matcher that is the result of evaluating |AtomEscape| with argument _direction_. +

          The production ExtendedAtom ::! `\` [lookahead == `c`] evaluates as follows:

          + + 1. Let _A_ be the CharSet containing the single character `\\` U+005C (REVERSE SOLIDUS). + 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). +

          The production Atom :: CharacterClass evaluates as follows:

          1. Evaluate |CharacterClass| to obtain a CharSet _A_ and a Boolean _invert_. @@ -35345,6 +35424,14 @@

          Atom

          1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_. +

          The production ExtendedAtom ::! ExtendedPatternCharacter evaluates as follows:

          + + 1. Let _ch_ be the character represented by |ExtendedPatternCharacter|. + 1. Let _A_ be a one-element CharSet containing the character _ch_. + 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). + +

          ----

          +

          The evaluation rules for the |Atom| productions except for Atom :: PatternCharacter are also used for the |ExtendedAtom| productions, but with |ExtendedAtom| substituted for |Atom|.

          @@ -35516,10 +35603,28 @@

          NonemptyClassRanges

          1. Evaluate the first |ClassAtom| to obtain a CharSet _A_. 1. Evaluate the second |ClassAtom| to obtain a CharSet _B_. 1. Evaluate |ClassRanges| to obtain a CharSet _C_. - 1. Let _D_ be ! CharacterRange(_A_, _B_). + 1. Let _D_ be ! CharacterRangeOrUnion(_A_, _B_). 1. Return the union of _D_ and _C_.
          + +

          + CharacterRangeOrUnion ( + _A_: a CharSet, + _B_: a CharSet, + ) +

          +
          +
          + + 1. If _Unicode_ is *false*, then + 1. If _A_ does not contain exactly one character or _B_ does not contain exactly one character, then + 1. Let _C_ be the CharSet containing the single character `-` U+002D (HYPHEN-MINUS). + 1. Return the union of CharSets _A_, _B_ and _C_. + 1. Return ! CharacterRange(_A_, _B_). + +
          +

          CharacterRange ( @@ -35558,7 +35663,7 @@

          NonemptyClassRangesNoDash

          1. Evaluate |ClassAtomNoDash| to obtain a CharSet _A_. 1. Evaluate |ClassAtom| to obtain a CharSet _B_. 1. Evaluate |ClassRanges| to obtain a CharSet _C_. - 1. Let _D_ be ! CharacterRange(_A_, _B_). + 1. Let _D_ be ! CharacterRangeOrUnion(_A_, _B_). 1. Return the union of _D_ and _C_. @@ -35586,25 +35691,32 @@

          ClassAtom

          ClassAtomNoDash

          -

          The production ClassAtomNoDash :: SourceCharacter but not one of `\` or `]` or `-` evaluates as follows:

          +

          The production ClassAtomNoDash ::! SourceCharacter but not one of `\` or `]` or `-` evaluates as follows:

          1. Return the CharSet containing the character matched by |SourceCharacter|. -

          The production ClassAtomNoDash :: `\` ClassEscape evaluates as follows:

          +

          The production ClassAtomNoDash ::! `\` ClassEscape evaluates as follows:

          1. Return the CharSet that is the result of evaluating |ClassEscape|. +

          The production ClassAtomNoDash ::! `\` [lookahead == `c`] evaluates as follows:

          + + 1. Return the CharSet containing the single character `\\` U+005C (REVERSE SOLIDUS). + + This production can only be reached from the sequence `\c` within a character class where it is not followed by an acceptable control character.

          ClassEscape

          The |ClassEscape| productions evaluate as follows:

          - ClassEscape :: `b` + ClassEscape ::! `b` + + ClassEscape ::! `-` - ClassEscape :: `-` + ClassEscape ::! `c` ClassControlLetter - ClassEscape :: CharacterEscape + ClassEscape ::! CharacterEscape 1. Let _cv_ be the CharacterValue of this |ClassEscape|. @@ -35612,7 +35724,7 @@

          ClassEscape

          1. Return the CharSet containing the single character _c_.
          - ClassEscape :: CharacterClassEscape + ClassEscape ::! CharacterClassEscape 1. Return the CharSet that is the result of evaluating |CharacterClassEscape|. @@ -35625,13 +35737,13 @@

          ClassEscape

          AtomEscape

          With parameter _direction_.

          -

          The production AtomEscape :: DecimalEscape evaluates as follows:

          +

          The production AtomEscape ::! DecimalEscape evaluates as follows:

          1. Evaluate |DecimalEscape| to obtain an integer _n_. 1. Assert: _n_ ≤ _NcapturingParens_. 1. Return ! BackreferenceMatcher(_n_, _direction_). -

          The production AtomEscape :: CharacterClassEscape evaluates as follows:

          +

          The production AtomEscape ::! CharacterClassEscape evaluates as follows:

          1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_. 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). @@ -35639,13 +35751,13 @@

          AtomEscape

          An escape sequence of the form `\\` followed by a non-zero decimal number _n_ matches the result of the _n_th set of capturing parentheses (). It is an error if the regular expression has fewer than _n_ capturing parentheses. If the regular expression has _n_ or more capturing parentheses but the _n_th one is *undefined* because it has not captured anything, then the backreference always succeeds.

          -

          The production AtomEscape :: CharacterEscape evaluates as follows:

          +

          The production AtomEscape ::! CharacterEscape evaluates as follows:

          1. Evaluate |CharacterEscape| to obtain a character _ch_. 1. Let _A_ be a one-element CharSet containing the character _ch_. 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). -

          The production AtomEscape :: `k` GroupName evaluates as follows:

          +

          The production AtomEscape ::! `k` GroupName evaluates as follows:

          1. Search the enclosing |Pattern| for an instance of a |GroupSpecifier| containing a |RegExpIdentifierName| which has a CapturingGroupName equal to the CapturingGroupName of the |RegExpIdentifierName| contained in |GroupName|. 1. Assert: A unique such |GroupSpecifier| is found. @@ -35752,12 +35864,13 @@

          CharacterClassEscape

          CharacterEscape

          The |CharacterEscape| productions evaluate as follows:

          - CharacterEscape :: + CharacterEscape ::! ControlEscape `c` ControlLetter `0` [lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence + LegacyOctalEscapeSequence IdentityEscape @@ -46520,9 +46633,13 @@

          Regular Expressions

          + + + + @@ -46539,6 +46656,7 @@

          Regular Expressions

          + @@ -46558,6 +46676,7 @@

          Regular Expressions

          + @@ -46579,252 +46698,7 @@

          HTML-like Comments

          Regular Expressions Patterns

          -

          The syntax of is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

          -

          This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [UnicodeMode] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [UnicodeMode] parameter present on the goal symbol.

          -

          Syntax

          - - Term[UnicodeMode, N] :: - [+UnicodeMode] Assertion[+UnicodeMode, ?N] - [+UnicodeMode] Atom[+UnicodeMode, ?N] Quantifier - [+UnicodeMode] Atom[+UnicodeMode, ?N] - [~UnicodeMode] QuantifiableAssertion[?N] Quantifier - [~UnicodeMode] Assertion[~UnicodeMode, ?N] - [~UnicodeMode] ExtendedAtom[?N] Quantifier - [~UnicodeMode] ExtendedAtom[?N] - - Assertion[UnicodeMode, N] :: - `^` - `$` - `\` `b` - `\` `B` - [+UnicodeMode] `(` `?` `=` Disjunction[+UnicodeMode, ?N] `)` - [+UnicodeMode] `(` `?` `!` Disjunction[+UnicodeMode, ?N] `)` - [~UnicodeMode] QuantifiableAssertion[?N] - `(` `?` `<=` Disjunction[?UnicodeMode, ?N] `)` - `(` `?` `<!` Disjunction[?UnicodeMode, ?N] `)` - - QuantifiableAssertion[N] :: - `(` `?` `=` Disjunction[~UnicodeMode, ?N] `)` - `(` `?` `!` Disjunction[~UnicodeMode, ?N] `)` - - ExtendedAtom[N] :: - `.` - `\` AtomEscape[~UnicodeMode, ?N] - `\` [lookahead == `c`] - CharacterClass[~UnicodeMode] - `(` Disjunction[~UnicodeMode, ?N] `)` - `(` `?` `:` Disjunction[~UnicodeMode, ?N] `)` - InvalidBracedQuantifier - ExtendedPatternCharacter - - InvalidBracedQuantifier :: - `{` DecimalDigits[~Sep] `}` - `{` DecimalDigits[~Sep] `,` `}` - `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}` - - ExtendedPatternCharacter :: - SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|` - - AtomEscape[UnicodeMode, N] :: - [+UnicodeMode] DecimalEscape - [~UnicodeMode] DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is ≤ _NcapturingParens_] - CharacterClassEscape[?UnicodeMode] - CharacterEscape[?UnicodeMode, ?N] - [+N] `k` GroupName[?UnicodeMode] - - CharacterEscape[UnicodeMode, N] :: - ControlEscape - `c` ControlLetter - `0` [lookahead ∉ DecimalDigit] - HexEscapeSequence - RegExpUnicodeEscapeSequence[?UnicodeMode] - [~UnicodeMode] LegacyOctalEscapeSequence - IdentityEscape[?UnicodeMode, ?N] - - IdentityEscape[UnicodeMode, N] :: - [+UnicodeMode] SyntaxCharacter - [+UnicodeMode] `/` - [~UnicodeMode] SourceCharacterIdentityEscape[?N] - - SourceCharacterIdentityEscape[N] :: - [~N] SourceCharacter but not `c` - [+N] SourceCharacter but not one of `c` or `k` - - ClassAtomNoDash[UnicodeMode, N] :: - SourceCharacter but not one of `\` or `]` or `-` - `\` ClassEscape[?UnicodeMode, ?N] - `\` [lookahead == `c`] - - ClassEscape[UnicodeMode, N] :: - `b` - [+UnicodeMode] `-` - [~UnicodeMode] `c` ClassControlLetter - CharacterClassEscape[?UnicodeMode] - CharacterEscape[?UnicodeMode, ?N] - - ClassControlLetter :: - DecimalDigit - `_` - - -

          When the same left-hand sides occurs with both [+UnicodeMode] and [\~UnicodeMode] guards it is to control the disambiguation priority.

          -
          - - -

          Static Semantics: Early Errors

          -

          The semantics of is extended as follows:

          - ExtendedAtom :: InvalidBracedQuantifier -
            -
          • - It is a Syntax Error if any source text matches this rule. -
          • -
          -

          Additionally, the rules for the following productions are modified with the addition of the highlighted text:

          - NonemptyClassRanges :: ClassAtom `-` ClassAtom ClassRanges -
            -
          • - It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *true* or IsCharacterClass of the second |ClassAtom| is *true* and this production has a [UnicodeMode] parameter. -
          • -
          • - It is a Syntax Error if IsCharacterClass of the first |ClassAtom| is *false* and IsCharacterClass of the second |ClassAtom| is *false* and the CharacterValue of the first |ClassAtom| is larger than the CharacterValue of the second |ClassAtom|. -
          • -
          - NonemptyClassRangesNoDash :: ClassAtomNoDash `-` ClassAtom ClassRanges -
            -
          • - It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *true* or IsCharacterClass of |ClassAtom| is *true* and this production has a [UnicodeMode] parameter. -
          • -
          • - It is a Syntax Error if IsCharacterClass of |ClassAtomNoDash| is *false* and IsCharacterClass of |ClassAtom| is *false* and the CharacterValue of |ClassAtomNoDash| is larger than the CharacterValue of |ClassAtom|. -
          • -
          -
          - - -

          Static Semantics: IsCharacterClass

          -

          The semantics of is extended as follows:

          - - ClassAtomNoDash :: `\` [lookahead == `c`] - - - 1. Return *false*. - -
          - - -

          Static Semantics: CharacterValue

          -

          The semantics of is extended as follows:

          - - ClassAtomNoDash :: `\` [lookahead == `c`] - - - 1. Return the code point value of U+005C (REVERSE SOLIDUS). - - ClassEscape :: `c` ClassControlLetter - - 1. Let _ch_ be the code point matched by |ClassControlLetter|. - 1. Let _i_ be _ch_'s code point value. - 1. Return the remainder of dividing _i_ by 32. - - CharacterEscape :: LegacyOctalEscapeSequence - - 1. Return the MV of |LegacyOctalEscapeSequence| (see ). - -
          - - -

          Pattern Semantics

          -

          The semantics of is extended as follows:

          -

          Within reference to “Atom :: `(` GroupSpecifier Disjunction `)` ” are to be interpreted as meaning “Atom :: `(` GroupSpecifier Disjunction `)` ” or “ExtendedAtom :: `(` Disjunction `)` ”.

          - -

          Term () includes the following additional evaluation rules:

          -

          The production Term :: QuantifiableAssertion Quantifier evaluates the same as the production Term :: Atom Quantifier but with |QuantifiableAssertion| substituted for |Atom|.

          -

          The production Term :: ExtendedAtom Quantifier evaluates the same as the production Term :: Atom Quantifier but with |ExtendedAtom| substituted for |Atom|.

          -

          The production Term :: ExtendedAtom evaluates the same as the production Term :: Atom but with |ExtendedAtom| substituted for |Atom|.

          - -

          Assertion () includes the following additional evaluation rule:

          -

          The production Assertion :: QuantifiableAssertion evaluates as follows:

          - - 1. Evaluate |QuantifiableAssertion| to obtain a Matcher _m_. - 1. Return _m_. - - -

          Assertion () evaluation rules for the Assertion :: `(` `?` `=` Disjunction `)` and Assertion :: `(` `?` `!` Disjunction `)` productions are also used for the |QuantifiableAssertion| productions, but with |QuantifiableAssertion| substituted for |Assertion|.

          - -

          Atom () evaluation rules for the |Atom| productions except for Atom :: PatternCharacter are also used for the |ExtendedAtom| productions, but with |ExtendedAtom| substituted for |Atom|. The following evaluation rules, with parameter _direction_, are also added:

          -

          The production ExtendedAtom :: `\` [lookahead == `c`] evaluates as follows:

          - - 1. Let _A_ be the CharSet containing the single character `\\` U+005C (REVERSE SOLIDUS). - 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). - -

          The production ExtendedAtom :: ExtendedPatternCharacter evaluates as follows:

          - - 1. Let _ch_ be the character represented by |ExtendedPatternCharacter|. - 1. Let _A_ be a one-element CharSet containing the character _ch_. - 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). - - -

          CharacterEscape () includes the following additional evaluation rule:

          -

          The production CharacterEscape :: LegacyOctalEscapeSequence evaluates as follows:

          - - 1. Let _cv_ be the CharacterValue of this |CharacterEscape|. - 1. Return the character whose character value is _cv_. - - -

          NonemptyClassRanges () modifies the following evaluation rule:

          -

          The production NonemptyClassRanges :: ClassAtom `-` ClassAtom ClassRanges evaluates as follows:

          - - 1. Evaluate the first |ClassAtom| to obtain a CharSet _A_. - 1. Evaluate the second |ClassAtom| to obtain a CharSet _B_. - 1. Evaluate |ClassRanges| to obtain a CharSet _C_. - 1. Let _D_ be ! CharacterRangeOrUnion(_A_, _B_). - 1. Return the union of _D_ and _C_. - - -

          NonemptyClassRangesNoDash () modifies the following evaluation rule:

          -

          The production NonemptyClassRangesNoDash :: ClassAtomNoDash `-` ClassAtom ClassRanges evaluates as follows:

          - - 1. Evaluate |ClassAtomNoDash| to obtain a CharSet _A_. - 1. Evaluate |ClassAtom| to obtain a CharSet _B_. - 1. Evaluate |ClassRanges| to obtain a CharSet _C_. - 1. Let _D_ be ! CharacterRangeOrUnion(_A_, _B_). - 1. Return the union of _D_ and _C_. - - -

          ClassEscape () includes the following additional evaluation rule:

          -

          The production ClassEscape :: `c` ClassControlLetter evaluates as follows:

          - - 1. Let _cv_ be the CharacterValue of this |ClassEscape|. - 1. Let _c_ be the character whose character value is _cv_. - 1. Return the CharSet containing the single character _c_. - - -

          ClassAtomNoDash () includes the following additional evaluation rule:

          -

          The production ClassAtomNoDash :: `\` [lookahead == `c`] evaluates as follows:

          - - 1. Return the CharSet containing the single character `\\` U+005C (REVERSE SOLIDUS). - - - This production can only be reached from the sequence `\c` within a character class where it is not followed by an acceptable control character. - - -

          - CharacterRangeOrUnion ( - _A_: a CharSet, - _B_: a CharSet, - ) -

          -
          -
          - - 1. If _Unicode_ is *false*, then - 1. If _A_ does not contain exactly one character or _B_ does not contain exactly one character, then - 1. Let _C_ be the CharSet containing the single character `-` U+002D (HYPHEN-MINUS). - 1. Return the union of CharSets _A_, _B_ and _C_. - 1. Return ! CharacterRange(_A_, _B_). - -
          -
          +

          Some of the syntax and semantics of BMP patterns ([~UnicodeMode]) used to be normative optional.

          From b5169cf0eb890ae9f89508a1b5b90f2f074034b4 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Tue, 20 Aug 2019 15:10:44 -0400 Subject: [PATCH 06/12] Editorial: Simplify the 'Assertion' production ... by using 'QuantifiableAssertion' under [+U] too. (This involves adding the [U] parameter to QuantifiableAssertion.) --- spec.html | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/spec.html b/spec.html index 56ec0a727c..8496c052e9 100644 --- a/spec.html +++ b/spec.html @@ -34261,7 +34261,7 @@

          Patterns

          [+UnicodeMode] Assertion[+UnicodeMode, ?N] [+UnicodeMode] Atom[+UnicodeMode, ?N] Quantifier [+UnicodeMode] Atom[+UnicodeMode, ?N] - [~UnicodeMode] QuantifiableAssertion[?N] Quantifier + [~UnicodeMode] QuantifiableAssertion[~UnicodeMode, ?N] Quantifier [~UnicodeMode] Assertion[~UnicodeMode, ?N] [~UnicodeMode] ExtendedAtom[?N] Quantifier [~UnicodeMode] ExtendedAtom[?N] @@ -34271,15 +34271,13 @@

          Patterns

          `$` `\` `b` `\` `B` - [+UnicodeMode] `(` `?` `=` Disjunction[+UnicodeMode, ?N] `)` - [+UnicodeMode] `(` `?` `!` Disjunction[+UnicodeMode, ?N] `)` - [~UnicodeMode] QuantifiableAssertion[?N] + QuantifiableAssertion[?UnicodeMode, ?N] `(` `?` `<=` Disjunction[?UnicodeMode, ?N] `)` `(` `?` `<!` Disjunction[?UnicodeMode, ?N] `)` - QuantifiableAssertion[N] :: - `(` `?` `=` Disjunction[~UnicodeMode, ?N] `)` - `(` `?` `!` Disjunction[~UnicodeMode, ?N] `)` + QuantifiableAssertion[UnicodeMode, N] :: + `(` `?` `=` Disjunction[?UnicodeMode, ?N] `)` + `(` `?` `!` Disjunction[?UnicodeMode, ?N] `)` Quantifier :: QuantifierPrefix @@ -35240,7 +35238,7 @@

          Assertion

          1. If _a_ is *true* and _b_ is *true*, or if _a_ is *false* and _b_ is *false*, return _c_(_x_). 1. Return ~failure~.
          -

          The production Assertion :: `(` `?` `=` Disjunction `)` evaluates as follows:

          +

          The production QuantifiableAssertion :: `(` `?` `=` Disjunction `)` evaluates as follows:

          1. Evaluate |Disjunction| with 1 as its _direction_ argument to obtain a Matcher _m_. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_ and performs the following steps when called: @@ -35257,7 +35255,7 @@

          Assertion

          1. Let _z_ be the State (_xe_, _cap_). 1. Return _c_(_z_).
          -

          The production Assertion :: `(` `?` `!` Disjunction `)` evaluates as follows:

          +

          The production QuantifiableAssertion :: `(` `?` `!` Disjunction `)` evaluates as follows:

          1. Evaluate |Disjunction| with 1 as its _direction_ argument to obtain a Matcher _m_. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_ and performs the following steps when called: @@ -35305,8 +35303,6 @@

          Assertion

          1. If _r_ is not ~failure~, return ~failure~. 1. Return _c_(_x_).
          -

          ----

          -

          The evaluation rules for the Assertion :: `(` `?` `=` Disjunction `)` and Assertion :: `(` `?` `!` Disjunction `)` productions are also used for the |QuantifiableAssertion| productions, but with |QuantifiableAssertion| substituted for |Assertion|.

          From c9d5ae34ea790adb095203f09fd04779178a6ecf Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Tue, 20 Aug 2019 15:14:19 -0400 Subject: [PATCH 07/12] Editorial: Move the runtime semantics for QuantifiableAssertion ... down to its proper place in production order. (This commit's diff probably shows a complicated combination of tweaks, but it's really just taking a block of 22 lines and shifting it down the file.) --- spec.html | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/spec.html b/spec.html index 8496c052e9..ea683c3be1 100644 --- a/spec.html +++ b/spec.html @@ -35238,9 +35238,14 @@

          Assertion

          1. If _a_ is *true* and _b_ is *true*, or if _a_ is *false* and _b_ is *false*, return _c_(_x_). 1. Return ~failure~. -

          The production QuantifiableAssertion :: `(` `?` `=` Disjunction `)` evaluates as follows:

          +

          The production Assertion :: QuantifiableAssertion evaluates as follows:

          - 1. Evaluate |Disjunction| with 1 as its _direction_ argument to obtain a Matcher _m_. + 1. Evaluate |QuantifiableAssertion| to obtain a Matcher _m_. + 1. Return _m_. + +

          The production Assertion :: `(` `?` `<=` Disjunction `)` evaluates as follows:

          + + 1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_ and performs the following steps when called: 1. Assert: _x_ is a State. 1. Assert: _c_ is a Continuation. @@ -35255,9 +35260,9 @@

          Assertion

          1. Let _z_ be the State (_xe_, _cap_). 1. Return _c_(_z_).
          -

          The production QuantifiableAssertion :: `(` `?` `!` Disjunction `)` evaluates as follows:

          +

          The production Assertion :: `(` `?` `<!` Disjunction `)` evaluates as follows:

          - 1. Evaluate |Disjunction| with 1 as its _direction_ argument to obtain a Matcher _m_. + 1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_ and performs the following steps when called: 1. Assert: _x_ is a State. 1. Assert: _c_ is a Continuation. @@ -35268,14 +35273,9 @@

          Assertion

          1. If _r_ is not ~failure~, return ~failure~. 1. Return _c_(_x_).
          -

          The production Assertion :: QuantifiableAssertion evaluates as follows:

          - - 1. Evaluate |QuantifiableAssertion| to obtain a Matcher _m_. - 1. Return _m_. - -

          The production Assertion :: `(` `?` `<=` Disjunction `)` evaluates as follows:

          +

          The production QuantifiableAssertion :: `(` `?` `=` Disjunction `)` evaluates as follows:

          - 1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. + 1. Evaluate |Disjunction| with 1 as its _direction_ argument to obtain a Matcher _m_. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_ and performs the following steps when called: 1. Assert: _x_ is a State. 1. Assert: _c_ is a Continuation. @@ -35290,9 +35290,9 @@

          Assertion

          1. Let _z_ be the State (_xe_, _cap_). 1. Return _c_(_z_).
          -

          The production Assertion :: `(` `?` `<!` Disjunction `)` evaluates as follows:

          +

          The production QuantifiableAssertion :: `(` `?` `!` Disjunction `)` evaluates as follows:

          - 1. Evaluate |Disjunction| with -1 as its _direction_ argument to obtain a Matcher _m_. + 1. Evaluate |Disjunction| with 1 as its _direction_ argument to obtain a Matcher _m_. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_ and performs the following steps when called: 1. Assert: _x_ is a State. 1. Assert: _c_ is a Continuation. From 74891c8be0fc249351243816d628196dccbd2d12 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Tue, 20 Aug 2019 15:48:03 -0400 Subject: [PATCH 08/12] Editorial: Merge 'ExtendedAtom' and 'Atom' Note that: ExtendedAtom was only ever 'invoked' under [~U], and when the merged production is invoked with [~U], it exactly reproduces the RHSs of ExtendedAtom. Atom was only ever invoked under [+U], and when the merged production is invoked with [+U], it reproduces the RHSs of former Atom except for the placement of the PatternCharacter RHS, but that's okay, because former Atom wasn't an order-disambiguated production. --- spec.html | 71 +++++++++++++++++++++++-------------------------------- 1 file changed, 30 insertions(+), 41 deletions(-) diff --git a/spec.html b/spec.html index ea683c3be1..d07031d993 100644 --- a/spec.html +++ b/spec.html @@ -34263,8 +34263,8 @@

          Patterns

          [+UnicodeMode] Atom[+UnicodeMode, ?N] [~UnicodeMode] QuantifiableAssertion[~UnicodeMode, ?N] Quantifier [~UnicodeMode] Assertion[~UnicodeMode, ?N] - [~UnicodeMode] ExtendedAtom[?N] Quantifier - [~UnicodeMode] ExtendedAtom[?N] + [~UnicodeMode] Atom[~UnicodeMode, ?N] Quantifier + [~UnicodeMode] Atom[~UnicodeMode, ?N] Assertion[UnicodeMode, N] :: `^` @@ -34291,23 +34291,17 @@

          Patterns

          `{` DecimalDigits[~Sep] `,` `}` `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}` - ExtendedAtom[N] ::! - `.` - `\` AtomEscape[~UnicodeMode, ?N] - `\` [lookahead == `c`] - CharacterClass[~UnicodeMode] - `(` Disjunction[~UnicodeMode, ?N] `)` - `(` `?` `:` Disjunction[~UnicodeMode, ?N] `)` - InvalidBracedQuantifier - ExtendedPatternCharacter - - Atom[UnicodeMode, N] :: - PatternCharacter + Atom[UnicodeMode, N] ::! `.` `\` AtomEscape[?UnicodeMode, ?N] + [~UnicodeMode] `\` [lookahead == `c`] CharacterClass[?UnicodeMode] - `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` + [+UnicodeMode] `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` + [~UnicodeMode] `(` Disjunction[?UnicodeMode, ?N] `)` `(` `?` `:` Disjunction[?UnicodeMode, ?N] `)` + [~UnicodeMode] InvalidBracedQuantifier + [+UnicodeMode] PatternCharacter + [~UnicodeMode] ExtendedPatternCharacter InvalidBracedQuantifier :: `{` DecimalDigits[~Sep] `}` @@ -34493,7 +34487,7 @@

          Escapes

          Term ::! QuantifiableAssertion Quantifier - ExtendedAtom ::! `\` [lookahead == `c`] + Atom ::! `\` [lookahead == `c`] ClassAtomNoDash ::! `\` [lookahead == `c`] @@ -34524,7 +34518,7 @@

          Static Semantics: Early Errors

          It is a Syntax Error if the MV of the first |DecimalDigits| is larger than the MV of the second |DecimalDigits|.

        - ExtendedAtom ::! InvalidBracedQuantifier + Atom ::! InvalidBracedQuantifier
        • It is a Syntax Error if any source text matches this rule. @@ -34950,7 +34944,7 @@

          Notation

          _InputLength_ is the number of characters in _Input_.
        • - _NcapturingParens_ is the total number of left-capturing parentheses (i.e. the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes) in the pattern. A left-capturing parenthesis is any `(` pattern character that is matched by the `(` terminal of the Atom :: `(` GroupSpecifier Disjunction `)` production. + _NcapturingParens_ is the total number of left-capturing parentheses (i.e. the total number of Atom ::! `(` GroupSpecifier Disjunction `)` Parse Nodes) in the pattern. A left-capturing parenthesis is any `(` pattern character that is matched by the `(` terminal of the Atom ::! `(` GroupSpecifier Disjunction `)` production.
        • _DotAll_ is *true* if the RegExp object's [[OriginalFlags]] internal slot contains *"s"* and otherwise is *false*. @@ -35099,18 +35093,16 @@

          Term

          1. Evaluate |Atom| with argument _direction_ to obtain a Matcher _m_. 1. Evaluate |Quantifier| to obtain the three results: a non-negative integer _min_, a non-negative integer (or +∞) _max_, and Boolean _greedy_. 1. Assert: _min_ ≤ _max_. - 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Term|. This is the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing this |Term|. - 1. Let _parenCount_ be the number of left-capturing parentheses in |Atom|. This is the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes enclosed by |Atom|. + 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Term|. This is the total number of Atom ::! `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing this |Term|. + 1. Let _parenCount_ be the number of left-capturing parentheses in |Atom|. This is the total number of Atom ::! `(` GroupSpecifier Disjunction `)` Parse Nodes enclosed by |Atom|. 1. Return a new Matcher with parameters (_x_, _c_) that captures _m_, _min_, _max_, _greedy_, _parenIndex_, and _parenCount_ and performs the following steps when called: 1. Assert: _x_ is a State. 1. Assert: _c_ is a Continuation. 1. Return ! RepeatMatcher(_m_, _min_, _max_, _greedy_, _x_, _c_, _parenIndex_, _parenCount_).

          ----

          -

          In the above algorithm, references to Atom :: `(` GroupSpecifier Disjunction `)` are to be interpreted as meaning Atom :: `(` GroupSpecifier Disjunction `)` or ExtendedAtom ::! `(` Disjunction `)` .

          +

          In the above algorithm, references to Atom ::! `(` GroupSpecifier Disjunction `)` are to be interpreted as meaning Atom ::! `(` GroupSpecifier Disjunction `)` or Atom ::! `(` Disjunction `)` .

          The production Term ::! QuantifiableAssertion Quantifier evaluates the same as the production Term ::! Atom Quantifier but with |QuantifiableAssertion| substituted for |Atom|.

          -

          The production Term ::! ExtendedAtom Quantifier evaluates the same as the production Term ::! Atom Quantifier but with |ExtendedAtom| substituted for |Atom|.

          -

          The production Term ::! ExtendedAtom evaluates the same as the production Term ::! Atom but with |ExtendedAtom| substituted for |Atom|.

          @@ -35365,37 +35357,31 @@

          Quantifier

          Atom

          With parameter _direction_.

          -

          The production Atom :: PatternCharacter evaluates as follows:

          - - 1. Let _ch_ be the character matched by |PatternCharacter|. - 1. Let _A_ be a one-element CharSet containing the character _ch_. - 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). - -

          The production Atom :: `.` evaluates as follows:

          +

          The production Atom ::! `.` evaluates as follows:

          1. Let _A_ be the CharSet of all characters. 1. If _DotAll_ is not *true*, then 1. Remove from _A_ all characters corresponding to a code point on the right-hand side of the |LineTerminator| production. 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). -

          The production Atom :: `\` AtomEscape evaluates as follows:

          +

          The production Atom ::! `\` AtomEscape evaluates as follows:

          1. Return the Matcher that is the result of evaluating |AtomEscape| with argument _direction_. -

          The production ExtendedAtom ::! `\` [lookahead == `c`] evaluates as follows:

          +

          The production Atom ::! `\` [lookahead == `c`] evaluates as follows:

          1. Let _A_ be the CharSet containing the single character `\\` U+005C (REVERSE SOLIDUS). 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). -

          The production Atom :: CharacterClass evaluates as follows:

          +

          The production Atom ::! CharacterClass evaluates as follows:

          1. Evaluate |CharacterClass| to obtain a CharSet _A_ and a Boolean _invert_. 1. Return ! CharacterSetMatcher(_A_, _invert_, _direction_). -

          The production Atom :: `(` GroupSpecifier Disjunction `)` evaluates as follows:

          +

          The production Atom ::! `(` GroupSpecifier Disjunction `)` evaluates as follows:

          1. Evaluate |Disjunction| with argument _direction_ to obtain a Matcher _m_. - 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Atom|. This is the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing this |Atom|. + 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Atom|. This is the total number of Atom ::! `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing this |Atom|. 1. Return a new Matcher with parameters (_x_, _c_) that captures _direction_, _m_, and _parenIndex_ and performs the following steps when called: 1. Assert: _x_ is a State. 1. Assert: _c_ is a Continuation. @@ -35416,18 +35402,22 @@

          Atom

          1. Return _c_(_z_). 1. Return _m_(_x_, _d_).
          -

          The production Atom :: `(` `?` `:` Disjunction `)` evaluates as follows:

          +

          The production Atom ::! `(` `?` `:` Disjunction `)` evaluates as follows:

          1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_. -

          The production ExtendedAtom ::! ExtendedPatternCharacter evaluates as follows:

          +

          The production Atom ::! ExtendedPatternCharacter evaluates as follows:

          1. Let _ch_ be the character represented by |ExtendedPatternCharacter|. 1. Let _A_ be a one-element CharSet containing the character _ch_. 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). -

          ----

          -

          The evaluation rules for the |Atom| productions except for Atom :: PatternCharacter are also used for the |ExtendedAtom| productions, but with |ExtendedAtom| substituted for |Atom|.

          +

          The production Atom ::! PatternCharacter evaluates as follows:

          + + 1. Let _ch_ be the character matched by |PatternCharacter|. + 1. Let _A_ be a one-element CharSet containing the character _ch_. + 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). +

          @@ -35757,7 +35747,7 @@

          AtomEscape

          1. Search the enclosing |Pattern| for an instance of a |GroupSpecifier| containing a |RegExpIdentifierName| which has a CapturingGroupName equal to the CapturingGroupName of the |RegExpIdentifierName| contained in |GroupName|. 1. Assert: A unique such |GroupSpecifier| is found. - 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of the located |GroupSpecifier|. This is the total number of Atom :: `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing the located |GroupSpecifier|, including its immediately enclosing |Atom|. + 1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of the located |GroupSpecifier|. This is the total number of Atom ::! `(` GroupSpecifier Disjunction `)` Parse Nodes prior to or enclosing the located |GroupSpecifier|, including its immediately enclosing |Atom|. 1. Return ! BackreferenceMatcher(_parenIndex_, _direction_). @@ -46632,7 +46622,6 @@

          Regular Expressions

          - From feb834670c16e6d04a87ddce7a5b4734142f1199 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Tue, 20 Aug 2019 16:19:01 -0400 Subject: [PATCH 09/12] Editorial: Simplify the 'Term' production ... by merging three pairs of RHSs. (Preserves the order of alternatives under [~U], but not under [+U], but that's okay, because the [+U] sides aren't order-disambiguated.) --- spec.html | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/spec.html b/spec.html index d07031d993..fbedbfd6b4 100644 --- a/spec.html +++ b/spec.html @@ -34258,13 +34258,10 @@

          Patterns

          Alternative[?UnicodeMode, ?N] Term[?UnicodeMode, ?N] Term[UnicodeMode, N] ::! - [+UnicodeMode] Assertion[+UnicodeMode, ?N] - [+UnicodeMode] Atom[+UnicodeMode, ?N] Quantifier - [+UnicodeMode] Atom[+UnicodeMode, ?N] [~UnicodeMode] QuantifiableAssertion[~UnicodeMode, ?N] Quantifier - [~UnicodeMode] Assertion[~UnicodeMode, ?N] - [~UnicodeMode] Atom[~UnicodeMode, ?N] Quantifier - [~UnicodeMode] Atom[~UnicodeMode, ?N] + Assertion[?UnicodeMode, ?N] + Atom[?UnicodeMode, ?N] Quantifier + Atom[?UnicodeMode, ?N] Assertion[UnicodeMode, N] :: `^` From 6f66976e3e37651a9e301de8fe5ae8ca9d158529 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Tue, 20 Aug 2019 20:30:44 -0400 Subject: [PATCH 10/12] Editorial: Simplify the 'Atom' production ... by merging the two capturing-group alternatives. (This may be affected by the outcome of issue #1673.) --- spec.html | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/spec.html b/spec.html index fbedbfd6b4..51a0802a9e 100644 --- a/spec.html +++ b/spec.html @@ -34293,8 +34293,7 @@

          Patterns

          `\` AtomEscape[?UnicodeMode, ?N] [~UnicodeMode] `\` [lookahead == `c`] CharacterClass[?UnicodeMode] - [+UnicodeMode] `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` - [~UnicodeMode] `(` Disjunction[?UnicodeMode, ?N] `)` + `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` `(` `?` `:` Disjunction[?UnicodeMode, ?N] `)` [~UnicodeMode] InvalidBracedQuantifier [+UnicodeMode] PatternCharacter @@ -34319,7 +34318,7 @@

          Group Specifiers

          GroupSpecifier[UnicodeMode] :: [empty] - `?` GroupName[?UnicodeMode] + [+UnicodeMode] `?` GroupName[?UnicodeMode] GroupName[UnicodeMode] :: `<` RegExpIdentifierName[?UnicodeMode] `>` @@ -35097,8 +35096,6 @@

          Term

          1. Assert: _c_ is a Continuation. 1. Return ! RepeatMatcher(_m_, _min_, _max_, _greedy_, _x_, _c_, _parenIndex_, _parenCount_). -

          ----

          -

          In the above algorithm, references to Atom ::! `(` GroupSpecifier Disjunction `)` are to be interpreted as meaning Atom ::! `(` GroupSpecifier Disjunction `)` or Atom ::! `(` Disjunction `)` .

          The production Term ::! QuantifiableAssertion Quantifier evaluates the same as the production Term ::! Atom Quantifier but with |QuantifiableAssertion| substituted for |Atom|.

          From e21c032891c83e829a8bb0bab3699eba6ea7a7f4 Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Wed, 21 Aug 2019 16:08:32 -0400 Subject: [PATCH 11/12] Editorial: Merge PatternCharacter + ExtendedPatternCharacter --- spec.html | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/spec.html b/spec.html index 51a0802a9e..085a5bde41 100644 --- a/spec.html +++ b/spec.html @@ -34296,19 +34296,16 @@

          Patterns

          `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` `(` `?` `:` Disjunction[?UnicodeMode, ?N] `)` [~UnicodeMode] InvalidBracedQuantifier - [+UnicodeMode] PatternCharacter - [~UnicodeMode] ExtendedPatternCharacter + PatternCharacter[?UnicodeMode] InvalidBracedQuantifier :: `{` DecimalDigits[~Sep] `}` `{` DecimalDigits[~Sep] `,` `}` `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}` - ExtendedPatternCharacter :: - SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|` - - PatternCharacter :: - SourceCharacter but not SyntaxCharacter + PatternCharacter[U] :: + [+U] SourceCharacter but not SyntaxCharacter + [~U] SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|` SyntaxCharacter :: one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `]` `{` `}` `|` @@ -35400,12 +35397,6 @@

          Atom

          1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_. -

          The production Atom ::! ExtendedPatternCharacter evaluates as follows:

          - - 1. Let _ch_ be the character represented by |ExtendedPatternCharacter|. - 1. Let _A_ be a one-element CharSet containing the character _ch_. - 1. Return ! CharacterSetMatcher(_A_, *false*, _direction_). -

          The production Atom ::! PatternCharacter evaluates as follows:

          1. Let _ch_ be the character matched by |PatternCharacter|. @@ -46618,7 +46609,6 @@

          Regular Expressions

          - From 42ca8e3be630118f0e335fb9967be9b8e90a921a Mon Sep 17 00:00:00 2001 From: Michael Dyck Date: Wed, 21 Aug 2019 16:19:59 -0400 Subject: [PATCH 12/12] Editorial: Add [N] parameter to 5 productions Specifically, add [N] parameter to CharacterClass ClassRanges NonemptyClassRanges NonemptyClassRangesNoDash ClassAtom These were implied when commit 95ec0c6 (of PR #1027)... - added [?N] to RHS occurrences of CharacterClass without explicitly adding [N] to the LHS occurrence CharacterClass; and - added [N] to the LHS occurrence of ClassAtomNoDash (in Annex B) without adding [?N] to any RHS occurrence. This commit propagates [N] across that gap. (See issue #1081.) --- spec.html | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/spec.html b/spec.html index 085a5bde41..c578e778f8 100644 --- a/spec.html +++ b/spec.html @@ -34292,7 +34292,7 @@

          Patterns

          `.` `\` AtomEscape[?UnicodeMode, ?N] [~UnicodeMode] `\` [lookahead == `c`] - CharacterClass[?UnicodeMode] + CharacterClass[?UnicodeMode, ?N] `(` GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] `)` `(` `?` `:` Disjunction[?UnicodeMode, ?N] `)` [~UnicodeMode] InvalidBracedQuantifier @@ -34303,9 +34303,9 @@

          Patterns

          `{` DecimalDigits[~Sep] `,` `}` `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}` - PatternCharacter[U] :: - [+U] SourceCharacter but not SyntaxCharacter - [~U] SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|` + PatternCharacter[UnicodeMode] :: + [+UnicodeMode] SourceCharacter but not SyntaxCharacter + [~UnicodeMode] SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|` SyntaxCharacter :: one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `]` `{` `}` `|` @@ -34343,27 +34343,27 @@

          Group Specifiers

          Character Classes

          - CharacterClass[UnicodeMode] :: - `[` [lookahead != `^`] ClassRanges[?UnicodeMode] `]` - `[` `^` ClassRanges[?UnicodeMode] `]` + CharacterClass[UnicodeMode, N] :: + `[` [lookahead != `^`] ClassRanges[?UnicodeMode, ?N] `]` + `[` `^` ClassRanges[?UnicodeMode, ?N] `]` - ClassRanges[UnicodeMode] :: + ClassRanges[UnicodeMode, N] :: [empty] - NonemptyClassRanges[?UnicodeMode] + NonemptyClassRanges[?UnicodeMode, ?N] - NonemptyClassRanges[UnicodeMode] :: - ClassAtom[?UnicodeMode] - ClassAtom[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] - ClassAtom[?UnicodeMode] `-` ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] + NonemptyClassRanges[UnicodeMode, N] :: + ClassAtom[?UnicodeMode, ?N] + ClassAtom[?UnicodeMode, ?N] NonemptyClassRangesNoDash[?UnicodeMode, ?N] + ClassAtom[?UnicodeMode, ?N] `-` ClassAtom[?UnicodeMode, ?N] ClassRanges[?UnicodeMode, ?N] - NonemptyClassRangesNoDash[UnicodeMode] :: - ClassAtom[?UnicodeMode] - ClassAtomNoDash[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] - ClassAtomNoDash[?UnicodeMode] `-` ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] + NonemptyClassRangesNoDash[UnicodeMode, N] :: + ClassAtom[?UnicodeMode, ?N] + ClassAtomNoDash[?UnicodeMode, ?N] NonemptyClassRangesNoDash[?UnicodeMode, ?N] + ClassAtomNoDash[?UnicodeMode, ?N] `-` ClassAtom[?UnicodeMode, ?N] ClassRanges[?UnicodeMode, ?N] - ClassAtom[UnicodeMode] :: + ClassAtom[UnicodeMode, N] :: `-` - ClassAtomNoDash[?UnicodeMode] + ClassAtomNoDash[?UnicodeMode, ?N] ClassAtomNoDash[UnicodeMode, N] ::! SourceCharacter but not one of `\` or `]` or `-`