Add syntax for string literals #17

mathiasbynens · 2021-04-08T17:04:16Z

During the April 2021 TC39 Incubator call, we received feedback that the addition of string literal syntax is necessary, to enable things like “match all RGI_Emoji except for the Belgian flag”.

We tried to leave it out of the proposal before in an attempt to keep it minimal, but now it’s become clear that it’s crucial functionality that should be included from the start.

What should the syntax be? UTS18 has \q{…}: https://unicode.org/reports/tr18/#Character_Ranges_with_Strings But we also considered not having a prefix at all, and instead just using {…}. Thoughts? Opinions?

The text was updated successfully, but these errors were encountered:

markusicu · 2021-04-08T17:30:08Z

ICU UnicodeSet pattern strings have long supported multi-character strings via curly braces. For example:
[a-zA-Z{ch}{m̀}{か゚}{🇦🇺}{🇧🇪}{🇫🇷}]

CLDR uses this syntax in its data files.

UTS #18 suggests \q{...} as a backwards-compatible addition to existing syntax. The additional prefix makes it more clunky:
[a-zA-Z\q{ch}\q{m̀}\q{か゚}\q{🇦🇺}\q{🇧🇪}\q{🇫🇷}]

Note that when you have multiple strings, even just the curly braces without prefix get hard to read.

In our discussions, we came up with using curly braces, but also adding the pipe symbol | as an internal separator, allowing multiple strings in one set of braces:
[a-zA-Z{ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷}]

I really like this improved syntax.

macchiati · 2021-04-08T17:51:44Z

I do as well, far easier to read. The working group on UTS #18 Unicode Regular Expressions is also recommending this for the next version.

mathiasbynens · 2021-04-08T20:41:39Z

Two random thoughts:

I’m assuming we want to allow string values containing the | character, by escaping it as \|. Agree/disagree?
Do we want {x} (where x is a single code point) to be allowed or disallowed?
- Argument in favor of allowing: that way, {…} can be used for any non-empty string.
- Argument in favor of disallowing: {x} is just a different way of writing x, so we don’t need to allow it.

sffc · 2021-04-08T20:46:32Z

I've expressed this before, but I have a mild to medium preference for either \q{abc} or (abc).

[a-zA-Z\q{ch}\q{m̀}\q{か゚}\q{🇦🇺}\q{🇧🇪}\q{🇫🇷}]
[a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)]

I'm not a fan of bare {} because when I read {}, I expect to find a modifier that tells me what the {} means. If there is a regular expression like /[p{letter}]/v, it looks like I am looking up a Unicode property "letter", but actually I am just matching the alternation "p" and "letter".

I am okay with () because that syntax is already widely understood as being an alternation of strings.

macchiati · 2021-04-08T20:49:07Z

Yes, inside {} the literal characters |, \, and } require quoting. I strongly favor {x} working. For one thing, then the user doesn't have to look up to know that {x̣} requires {} but that {ẍ} doesn't require them. When in doubt, the user can always use {}.

…

On Thu, Apr 8, 2021 at 1:41 PM Mathias Bynens ***@***.***> wrote: Two random thoughts: 1. I’m assuming we want to allow string values containing the | character, by escaping it as \|. Agree/disagree? 2. Do we want {x} (where x is a single code point) to be allowed or disallowed? - Argument in favor of allowing: that way, {…} can be used for any non-empty string. - Argument in favor of disallowing: {x} is just a different way of writing x, so we don’t need to allow it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMBVHCUDUARTN2RKDF3THYIJNANCNFSM42TJKEIQ> .

macchiati · 2021-04-08T20:55:21Z

Reasons for using {} for the syntax characters inside of brackets 1. is that they are quite uncommon, whereas () much more likely to be needed as literals. 2. "because when I read {}, I expect to find a modifier that tells me what the {} means." In other words, people expect {} to be syntax characters, and are likely to pause and consider what they would mean. They are less likely to do that with (). Mark

…

On Thu, Apr 8, 2021 at 1:46 PM Shane F. Carr ***@***.***> wrote: I've expressed this before, but I have a mild preference for either \q{abc} or (abc). [a-zA-Z\q{ch}\q{m̀}\q{か゚}\q{🇦🇺}\q{🇧🇪}\q{🇫🇷}] [a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)] I'm not a fan of bare {} because when I read {}, I expect to find a modifier that tells me what the {} means. If there is a regular expression like /[p{letter}]/v, it looks like I am looking up a Unicode property "letter", but actually I am just matching the alternation "p" and "letter". I am okay with () because that syntax is already widely understood as being an alternation of strings. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMHASWDIF2SR4ZU5LNLTHYI3XANCNFSM42TJKEIQ> .

sffc · 2021-04-08T21:03:28Z

In other words, people expect {} to be syntax characters, and are likely to pause and consider what they would mean. They are less likely to do that with ().

That's just not true. In ECMAScript Regex, () are already syntax characters, and always have been. They already do exactly what you are proposing {} to do: an alternation of strings.

they are quite uncommon, whereas () much more likely to be needed as literals.

Do you have evidence for that claim? I don't think it's any more or less likely to write a regular expression that tries to match {} than one that tries to match (). Both sets of characters should be reserved syntax.

markusicu · 2021-04-08T21:07:03Z

I’m assuming we want to allow string values containing the | character, by escaping it as \|. Agree/disagree?

Yes, absolutely. Character escapes should work inside {}. Property escapes make no sense here, of course.

\N{name} would make sense if ES wants to support it, and would make sense both for character (single code point) names and NamedSequences.

Do we want {x} (where x is a single code point) to be allowed or disallowed?

Argument in favor of allowing: that way, {…} can be used for any non-empty string.

Yes, absolutely.

And if ES regex allows an empty string in an alternation -- ...(|a|ab)... -- then this syntax should also allow an empty string: [...{|a|ab}...]

sffc · 2021-04-08T21:12:04Z

Another argument for using (): suppose I have the valid regex

/(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)/u

Now I decide that I want to perform a set operation on the alternation. With (), I can simply wrap it in [], like this:

/[(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)&&\p{Emoji}]/v

With any of the alternatives proposed, I would need to re-write my string alternation to some other syntax that works inside []. Why not use the same syntax inside [] as outside?

markusicu · 2021-04-08T21:12:08Z

As for braces vs. parentheses: AFAIK the only comparable existing practice is in ICU/CLDR (just braces) and in UTS #18 (braces with \q prefix), and I am not aware of any implementation of this part of UTS #18. As such, at least some developers are already familiar with braces for strings.

In any case I want to lobby for multiple strings with a separator. If we were to go with the prefix, that would be
[a-zA-Z\q{ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷}]

mathiasbynens · 2021-04-23T10:30:11Z

We settled on using parentheses, since it most closely matches the existing syntax for alternating between strings outside of character classes. One difference is that (…) creates a capturing group outside of character classes but not within character classes, but I think that’s clear from the context.

markusicu · 2021-05-27T18:33:30Z

We have advanced to stage 2 with string literals like (ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷).

lightmare · 2021-08-09T22:23:36Z

We settled on using parentheses, since it most closely matches the existing syntax for alternating between strings outside of character classes. One difference is that (…) creates a capturing group outside of character classes but not within character classes, but I think that’s clear from the context.

Another being that it matches left-to-right outside, longest-to-shortest inside.
I think it's a terrible choice, precisely because it looks identical but acts differently.

Let's look at some syntax precedent.
Currently in JS, \p{Letter} means the same thing outside a character class, as inside [\p{Letter}].
In this proposal, nesting allows [a-z] to mean the same thing outside a character class, as inside [[a-z]].

If you want to have string literal syntax that looks the same outside and inside, make it behave the same outside and inside.

For that I propose modifier | that will turn the inside of the brackets into this longest-first alternation, just like ^ turns a character class into its complement. This would make only 1 additional character inside brackets special, rather than 3.

[|a|aab|aaac] would behave like (?:aaac|aab|a)

And of course you could use this [|...] construct stand-alone, or nested / in set operations.

markusicu · 2021-08-09T22:46:22Z

[|a|aab|aaac] looks interesting, but since the brackets suggest a nested set, it immediately begs the question of whether you can put other stuff there, such as properties; and if it's limited to just string literals, it might be just as confusing as parentheses.

lightmare · 2021-08-10T12:53:14Z

... and if it's limited to just string literals, it might be just as confusing as parentheses

Quite the opposite. If I read the temporary grammar correctly, ClassStrings limit what can appear inside [(...)]. That's the third difference from (...) outside a character class.
Whether other stuff can appear inside [|...] is a separate debate, my point was to make the "longest-first-alternation" construct consistent regardless of whether it's nested or not.

markusicu · 2021-09-30T22:40:06Z

Update: After much discussion on issues #33 and #46 we have decided to go back to the UTS46 string literal syntax like \q{string|literal|syntax}. It signals to practitioners that if they don't know what it is they need to look it up, and it leaves the door open for string-range/abbreviation syntax using either parentheses or a different backslash-with-new-letter combination.

markusicu closed this as completed May 27, 2021

sffc mentioned this issue Sep 20, 2021

More syntax characters should be forbidden in ClassSyntaxCharacter #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add syntax for string literals #17

Add syntax for string literals #17

mathiasbynens commented Apr 8, 2021

markusicu commented Apr 8, 2021 •

edited

Loading

macchiati commented Apr 8, 2021 via email •

edited

Loading

mathiasbynens commented Apr 8, 2021

sffc commented Apr 8, 2021 •

edited

Loading

macchiati commented Apr 8, 2021 via email

macchiati commented Apr 8, 2021 via email

sffc commented Apr 8, 2021

markusicu commented Apr 8, 2021

sffc commented Apr 8, 2021 •

edited

Loading

markusicu commented Apr 8, 2021

mathiasbynens commented Apr 23, 2021

markusicu commented May 27, 2021

lightmare commented Aug 9, 2021

markusicu commented Aug 9, 2021

lightmare commented Aug 10, 2021

markusicu commented Sep 30, 2021

Add syntax for string literals #17

Add syntax for string literals #17

Comments

mathiasbynens commented Apr 8, 2021

markusicu commented Apr 8, 2021 • edited Loading

macchiati commented Apr 8, 2021 via email • edited Loading

mathiasbynens commented Apr 8, 2021

sffc commented Apr 8, 2021 • edited Loading

macchiati commented Apr 8, 2021 via email

macchiati commented Apr 8, 2021 via email

sffc commented Apr 8, 2021

markusicu commented Apr 8, 2021

sffc commented Apr 8, 2021 • edited Loading

markusicu commented Apr 8, 2021

mathiasbynens commented Apr 23, 2021

markusicu commented May 27, 2021

lightmare commented Aug 9, 2021

markusicu commented Aug 9, 2021

lightmare commented Aug 10, 2021

markusicu commented Sep 30, 2021

markusicu commented Apr 8, 2021 •

edited

Loading

macchiati commented Apr 8, 2021 via email •

edited

Loading

sffc commented Apr 8, 2021 •

edited

Loading

sffc commented Apr 8, 2021 •

edited

Loading