-
Notifications
You must be signed in to change notification settings - Fork 12
More syntax characters should be forbidden in ClassSyntaxCharacter #33
Comments
Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff |
I think trying to have identical syntax inside and outside of character classes at this point muddies the water, and has the potential to cause people to give up on strings inside character classes. Just because some expression X (eg, There is a non-zero cognitive cost to requiring escapes on characters. |
I think Shane is lobbying for throwing a syntax error for what looks like match operators but is put inside a character class string, rather than have it be silently accepted as literal string contents. I myself am skeptical that regex authors would be confused here.
I agree. Looking for balance here, hoping for feedback from stage 3 reviewers. |
Right. From a cognitive burden point of view, I think having something silently accepted that has surprising behavior is far worse than a few extra escape characters. To be clear, the characters we're talking about escaping already need to be escaped outside a character class, and they won't appear in every regular expression, so I am skeptical that there is a measurable cost to require these escapes. |
They don't currently need to be escaped inside a character class. It seems like we have three choices (note, all of this is for inside a character class):
|
Yes, I think those are the three options. Let's put it this way. Consider these four "zones":
We currently have two sets of escaping rules: one applies to zones A and B, and the other applies to zones C and D. The main difference is that I am making the claim that it is surprising that the escaping rules are different between zones B and D, since they both look similar in the regular expression. So the minimal change would be to restrict zone D's escaping rules to be more like A and B (with the addition of However, for consistency's sake, the best route may be to unify all four of these zones to the same set of escape rules. This is Markus's Option 2. A sub-question of Option 2 would be whether we should also require |
Another difference is that |
I am skeptical of requiring characters that do not have special meaning inside character classes to be escaped there. This would break some commonly used idioms for no good reason: some folks like to write |
@waldemarhorwat Does your comment relate to both zones C and D or only zone C (from #33 (comment))? In other words, would you be okay with requiring escapes within ClassStrings, such as:
|
Why would you want to do that? |
See a few comments up: #33 (comment) |
I still don't see the rationale. Escape syntax currently depends only on whether one is inside or outside |
Let me just explain where Shane is coming from, since he is out for a bit and I want to make progress on our list of issues. I am not endorsing Shane's suggestion. Shane is looking at our use of We had chosen the round-parentheses-with-pipe syntax for string literals deliberately to make it look a little like an alternation, and most of us are not worried about regex authors confusing real alternations outside of character classes with the string literal syntax inside. |
I agree with @sffc about the risk here. Introducing structure inside a character class where none existed before will suggest to at least some practitioners that even more metacharacters have special meaning, and absence of syntax errors in such cases will let unintended regular expressions slip by—heck, just lack of sleep would probably be sufficient for me to misinterpret something like |
Alternation has nothing to do with parentheses, it's just the pipe. Instead of making parentheses special, you could add another modifier like [|x|xy|xyz]
// instead of
[(x|xy|xyz)] |
The primary point of disagreement on this issue is the premise of the OP, that practitioners may experience unexpected behavior on |
We brainstormed about “researching” this topic in the TC39 Research Call last week. The main action item from it was “Team needs to agree on value, path we want to take” We discussed it this morning in the team meeting.
|
I do not agree with the above post from Markus.
|
We don’t need additional research to know that
Given that, I would strongly prefer not spending time researching options other than those already on the table:
|
My concern is about standardizing a syntax that will have surprising behavior to practitioners reading and writing regular expressions. My concern applies to both of those syntaxes:
In other words, "reverting" to the curly brace syntax does not address my concern. My concerns are based on hypotheses. The reason data acquisition is appealing to me is that it would help validate or invalidate my hypotheses. |
Note that currently (without our proposal and not just in ECMAScript), both parentheses and curly braces have very different rules inside vs. outside of character classes. Outside, parentheses are used for grouping and (with the question mark) various other syntax escapes, and curly braces are used for quantifiers ( This means that practitioners have always had to be aware of very different syntax rules outside vs. inside of character classes. |
PS: We know that several people really don't want to require more escaping than we need. I have a preference for consistent escaping inside of character classes. But I could live with more escaping inside Also, we have settled before on the string literal syntax, but I could live with the more verbose |
Re #17 (comment)
<#17 (comment)>
: *I claim that* it is misleading for curly braces to not have a specifier
character preceding them.
Looking at: "I'm not a fan of bare {} because when I read {}, I expect to
find a modifier that tells me what the {} means. If there is a regular
expression like /[p{letter}]/v, it looks like I am looking up a Unicode
property "letter", but actually I am just matching the alternation "p" and
"letter"."
People using regex seem to have very little problem with realizing that
characters inside a CC and outside a CC are different: that in [a*] the *
is a literal, and [a]* it is not. And expecting [p{letter}] to work exactly
like [\p{letter}] would be a user error.
Given that people don't like excessive escaping, I think the choices at
this point are clear
a. stick with (string|literal|syntax)
b. revert to {string|literal|syntax}
c. reverting further to \q{string|literal|syntax}
I could live with any of these, but prefer (a) since that is what we had
settled on.
Mark
…On Mon, Sep 20, 2021 at 12:14 PM Markus Scherer ***@***.***> wrote:
PS:
We know that several people really don't want to require more escaping
than we need. I have a preference for consistent escaping inside of
character classes. But *I could live with* more escaping inside
[(string|literal|syntax)] than in the rest of a character class.
Also, we have settled before on the string literal syntax, but *I could
live with* the more verbose [\q{string|literal|syntax}], with the \q
prefix, as suggested in UTS 18. If we did go back to this one, then I think
we should not need the additional escaping.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGCYEGKSD7CYVCZOC3UC6B2NANCNFSM47IIOYOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thank you for this comment, which presents a counter-argument for my hypothesis.
Yes, a user error. I firmly believe that an important part of our job as spec authors is to design a syntax that is resistant to user errors.
Let's keep all options on the table so that we know clearly what we are working with. a. stick with My hypothesis is that (a) will cause confusion to practitioners. @markusicu has offered a counter-argument. My hypothesis is that (b) will also cause confusion to practitioners. @macchiati agrees. I feel more strongly about this hypothesis than the previous one. I do not perceive substantial risk for (c) causing confusion to practitioners, but it comes with a (fairly small) ergonomics cost. My hypothesis remains that (d) offers the best balance between ergonomics and understandability. So I really see two paths forward:
The reason I suggested bringing this to the TC39 Research Call was to validate the premise of this issue. I do not believe strongly enough in my hypothesis to suggest that we revert to option (c) without seeing additional data. |
My hypothesis is that (b) will also cause confusion to practitioners.
@macchiati <https://github.com/macchiati> agrees. I feel more strongly
about this hypothesis than the previous one.
That isn't what I agreed with (sounds like I wasn't clear).
Strictly in terms of clarity, I think
1. \q{...} is best (but slightly more awkward)
2. {...} is somewhat worse than \q
3. (...) is somewhat worse than {...}
but I could live with any of them.
I don't think that escaping is required for any of them — except of course
that
1. requires | and } be escaped inside, that is, after \q{
2. requires { be escaped outside, and | and } inside
3. requires ( be escaped outside, and | and ) inside
and \ itself, of course.
I only hear one person so far saying that "[(a*|b*|c*)] will cause
confusion for practitioners"
Mark
…On Mon, Sep 20, 2021 at 6:46 PM Shane F. Carr ***@***.***> wrote:
Note that currently (without our proposal and not just in ECMAScript),
both parentheses and curly braces have very different rules inside vs.
outside of character classes. Outside, parentheses are used for grouping
and (with the question mark) various other syntax escapes, and curly braces
are used for quantifiers (a{3,5}) and for enclosing details of \u, \p,
and \P (and elsewhere also \b{g} etc.). Inside character classes, they
are currently all just literal characters.
This means that practitioners have always had to be aware of very
different syntax rules outside vs. inside of character classes.
Thank you for this comment, which presents a counter-argument for my
hypothesis.
People using regex seem to have very little problem with realizing that
characters inside a CC and outside a CC are different: that in [a*] the *
is a literal, and [a]* it is not. And expecting [p{letter}] to work exactly
like [\p{letter}] would be a user error.
Yes, a user error. I firmly believe that an important part of our job as
spec authors is to design a syntax that is resistant to user errors.
Given that people don't like excessive escaping, I think the choices at
this point are clear
Let's keep all options on the table so that we know clearly what we are
working with.
a. stick with (string|literal|syntax)
b. revert to {string|literal|syntax}
c. reverting further to \q{string|literal|syntax}
d. amend (string|literal|syntax) with more escape rules (multiple ways to
do that)
My hypothesis is that (a) will cause confusion to practitioners.
@markusicu <https://github.com/markusicu> has offered a counter-argument.
My hypothesis is that (b) will also cause confusion to practitioners.
@macchiati <https://github.com/macchiati> agrees. I feel more strongly
about this hypothesis than the previous one.
I do not perceive substantial risk for (c) causing confusion to
practitioners, but it comes with a (fairly small) ergonomics cost.
My hypothesis remains that (d) offers the best balance between ergonomics
and understandability.
@macchiati <https://github.com/macchiati> is opposed, based on the
assertion that it requires "excessive escaping," hurting ergonomics. I
disagree with that assertion since we are talking about only a handful of
syntax characters, and those characters are not particularly common (a
claim we could quantify if needed).
So I really see two paths forward:
1. Agree as a champions group that the premise of this issue, the
hypothesis that [(a*|b*|c*)] will cause confusion for practitioners,
is false, and close the issue.
2. Agree as a champions group that the premise of the issue is true,
and then choose one of the other choices that we have on the table. It
seems that (c) is the most likely fallback option.
The reason I suggested bringing this to the TC39 Research Call was to
validate the premise of this issue. I do not believe strongly enough in my
hypothesis to suggest that we revert to option (c) without seeing
additional data.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGNYGI7FZUYRIMXU23UC7PWRANCNFSM47IIOYOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Sorry for misunderstanding your position.
That isn't precisely what I'm saying, but I can count 3 people who have raised the concern:
|
Also, it is misrepresenting my position to say that I believe that "[(a*|b*|c*)] will cause confusion for practitioners". I am raising the hypothesis that it might cause confusion, a hypothesis which is based on anecdotal evidence. I am more than happy to debate the merits of the hypothesis. |
I would really like to unblock progress on this. Since |
That is true. However,
This is slightly incorrect. Curly braces are kind of supervillain characters — they have 3 different meanings outside character class (literal Parentheses, on the other hand, were always literal inside character class. |
I'm ok with that.
|
Looks like we might be converging on NoteIn the draft spec changes so far, inside character classes, we require escaping If we go back to Follow-up questionsShould we keep requiring escaping We could keep requiring escaping now, and we could stop requiring escaping later if practitioners complain. (“Old” If we didn't require escaping |
The status quo is |
We have heard a number of voices that the status quo is not acceptable, if
it includes requiring escapes for characters like *. Going back to \q
addresses that issue.
|
On the escaping if we go back to \q{...}. I suggest that:
|
I prefer the status quo of I don't want to add more contexts where |
I am one of the original advocates for In other words, if the hypothesis is invalid, I would prefer sticking with I'm dissatisfied with the dismissal of |
Looks like we are still struggling to settle this via comments. Meeting tomorrow. Extra escapesSounds like we won't require escaping more characters like String literal syntaxShane advocated for Looks like everyone can live with Mark and Mathias most recently lobbied for Consistent escaping and future extensionsWe need to decide, for inside character classes,
|
Meeting today with Richard, Mathias, Mark, and myself:
|
I would like to resolve #46 first. If our vision is for sets of strings to become more expressive, then we should use So, the following options are all okay with me:
The following options are not okay with me without further research:
|
I don't see how one informs the other very much. String literals with wildcards could be done either way. If anything, the ideas for wildcards are likely to end up with string literals being yet more different from expressions outside of character classes (some stuff similar, but much different), so probably actually better not to use
I think that this could easily take a couple of months of tossing around ideas for syntax and semantics of wildcards. I don't want to delay our proposal by that much.
I was skeptical about requiring more escapes based on the hypothesis that practitioners might be confused. |
I favor at this point not expanding the scope, and instead: 3.1. Declare issue 46 out of scope for now, and use \q{...}. If and when we ever want to do something along the lines of #46, we can handle it by having a new introducer for strings with fancy syntax: \δ{...}, where δ is a suitable available ASCII letter. |
This is my preference as well. |
I continue to believe that |
The discussion here and in 46 is pushing me away from |
Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane. |
/(a*)/
matches strings with zero or more "a". But currently/[(a*)]/v
matches the literal string "a*".I think we should try to be consistent where possible on the matching behavior of alternations
()
outside of character classes and sets of strings[()]
(ClassStrings) inside of character classes, because wrapping a string alternation with[]
should not cause the matching behavior to change in surprising ways. Concretely, I would like us to require escaping of all SyntaxCharacter in ClassSyntaxCharacter or at least in NonEmptyClassString.Summarizing the position of other champions based on our discussion:
@mathiasbynens has pointed out that the behavior of alternations and ClassStrings already differs in the sense that alternations create capturing groups, but ClassStrings do not.
@macchiati has pushed back on requiring more syntax characters to be escaped.
@markusicu has advocated for keeping the definition of ClassSyntaxCharacter consistent both inside and outside of ClassStrings within the context of a character class. He points out that syntax characters like *, +, ?, etc., are interpreted as literals in character classes already.
[\*(\*)]
instead of[*(*)]
), but I would be okay with only requiring the escape in ClassStrings ([*(\*)]
).The text was updated successfully, but these errors were encountered: