Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

Add syntax for string literals #17

Closed
mathiasbynens opened this issue Apr 8, 2021 · 16 comments
Closed

Add syntax for string literals #17

mathiasbynens opened this issue Apr 8, 2021 · 16 comments

Comments

@mathiasbynens
Copy link
Member

During the April 2021 TC39 Incubator call, we received feedback that the addition of string literal syntax is necessary, to enable things like “match all RGI_Emoji except for the Belgian flag”.

We tried to leave it out of the proposal before in an attempt to keep it minimal, but now it’s become clear that it’s crucial functionality that should be included from the start.

What should the syntax be? UTS18 has \q{…}: https://unicode.org/reports/tr18/#Character_Ranges_with_Strings But we also considered not having a prefix at all, and instead just using {…}. Thoughts? Opinions?

@markusicu
Copy link
Collaborator

markusicu commented Apr 8, 2021

ICU UnicodeSet pattern strings have long supported multi-character strings via curly braces. For example:
[a-zA-Z{ch}{m̀}{か゚}{🇦🇺}{🇧🇪}{🇫🇷}]

CLDR uses this syntax in its data files.

UTS #18 suggests \q{...} as a backwards-compatible addition to existing syntax. The additional prefix makes it more clunky:
[a-zA-Z\q{ch}\q{m̀}\q{か゚}\q{🇦🇺}\q{🇧🇪}\q{🇫🇷}]

Note that when you have multiple strings, even just the curly braces without prefix get hard to read.

In our discussions, we came up with using curly braces, but also adding the pipe symbol | as an internal separator, allowing multiple strings in one set of braces:
[a-zA-Z{ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷}]

I really like this improved syntax.

@macchiati
Copy link
Collaborator

macchiati commented Apr 8, 2021 via email

@mathiasbynens
Copy link
Member Author

Two random thoughts:

  1. I’m assuming we want to allow string values containing the | character, by escaping it as \|. Agree/disagree?

  2. Do we want {x} (where x is a single code point) to be allowed or disallowed?

    • Argument in favor of allowing: that way, {…} can be used for any non-empty string.
    • Argument in favor of disallowing: {x} is just a different way of writing x, so we don’t need to allow it.

@sffc
Copy link
Collaborator

sffc commented Apr 8, 2021

I've expressed this before, but I have a mild to medium preference for either \q{abc} or (abc).

[a-zA-Z\q{ch}\q{m̀}\q{か゚}\q{🇦🇺}\q{🇧🇪}\q{🇫🇷}]
[a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)]

I'm not a fan of bare {} because when I read {}, I expect to find a modifier that tells me what the {} means. If there is a regular expression like /[p{letter}]/v, it looks like I am looking up a Unicode property "letter", but actually I am just matching the alternation "p" and "letter".

I am okay with () because that syntax is already widely understood as being an alternation of strings.

@macchiati
Copy link
Collaborator

macchiati commented Apr 8, 2021 via email

@macchiati
Copy link
Collaborator

macchiati commented Apr 8, 2021 via email

@sffc
Copy link
Collaborator

sffc commented Apr 8, 2021

In other words, people expect {} to be syntax characters, and are likely to pause and consider what they would mean. They are less likely to do that with ().

That's just not true. In ECMAScript Regex, () are already syntax characters, and always have been. They already do exactly what you are proposing {} to do: an alternation of strings.

they are quite uncommon, whereas () much more likely to be needed as literals.

Do you have evidence for that claim? I don't think it's any more or less likely to write a regular expression that tries to match {} than one that tries to match (). Both sets of characters should be reserved syntax.

@markusicu
Copy link
Collaborator

  1. I’m assuming we want to allow string values containing the | character, by escaping it as \|. Agree/disagree?

Yes, absolutely. Character escapes should work inside {}. Property escapes make no sense here, of course.

\N{name} would make sense if ES wants to support it, and would make sense both for character (single code point) names and NamedSequences.

  1. Do we want {x} (where x is a single code point) to be allowed or disallowed?

    • Argument in favor of allowing: that way, {…} can be used for any non-empty string.

Yes, absolutely.

And if ES regex allows an empty string in an alternation -- ...(|a|ab)... -- then this syntax should also allow an empty string: [...{|a|ab}...]

@sffc
Copy link
Collaborator

sffc commented Apr 8, 2021

Another argument for using (): suppose I have the valid regex

/(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)/u

Now I decide that I want to perform a set operation on the alternation. With (), I can simply wrap it in [], like this:

/[(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)&&\p{Emoji}]/v

With any of the alternatives proposed, I would need to re-write my string alternation to some other syntax that works inside []. Why not use the same syntax inside [] as outside?

@markusicu
Copy link
Collaborator

As for braces vs. parentheses: AFAIK the only comparable existing practice is in ICU/CLDR (just braces) and in UTS #18 (braces with \q prefix), and I am not aware of any implementation of this part of UTS #18. As such, at least some developers are already familiar with braces for strings.

In any case I want to lobby for multiple strings with a separator. If we were to go with the prefix, that would be
[a-zA-Z\q{ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷}]

@mathiasbynens
Copy link
Member Author

We settled on using parentheses, since it most closely matches the existing syntax for alternating between strings outside of character classes. One difference is that (…) creates a capturing group outside of character classes but not within character classes, but I think that’s clear from the context.

@markusicu
Copy link
Collaborator

We have advanced to stage 2 with string literals like (ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷).

@lightmare
Copy link

We settled on using parentheses, since it most closely matches the existing syntax for alternating between strings outside of character classes. One difference is that (…) creates a capturing group outside of character classes but not within character classes, but I think that’s clear from the context.

Another being that it matches left-to-right outside, longest-to-shortest inside.
I think it's a terrible choice, precisely because it looks identical but acts differently.

Let's look at some syntax precedent.
Currently in JS, \p{Letter} means the same thing outside a character class, as inside [\p{Letter}].
In this proposal, nesting allows [a-z] to mean the same thing outside a character class, as inside [[a-z]].

If you want to have string literal syntax that looks the same outside and inside, make it behave the same outside and inside.

For that I propose modifier | that will turn the inside of the brackets into this longest-first alternation, just like ^ turns a character class into its complement. This would make only 1 additional character inside brackets special, rather than 3.

[|a|aab|aaac] would behave like (?:aaac|aab|a)

And of course you could use this [|...] construct stand-alone, or nested / in set operations.

@markusicu
Copy link
Collaborator

[|a|aab|aaac] looks interesting, but since the brackets suggest a nested set, it immediately begs the question of whether you can put other stuff there, such as properties; and if it's limited to just string literals, it might be just as confusing as parentheses.

@lightmare
Copy link

... and if it's limited to just string literals, it might be just as confusing as parentheses

Quite the opposite. If I read the temporary grammar correctly, ClassStrings limit what can appear inside [(...)]. That's the third difference from (...) outside a character class.
Whether other stuff can appear inside [|...] is a separate debate, my point was to make the "longest-first-alternation" construct consistent regardless of whether it's nested or not.

@markusicu
Copy link
Collaborator

Update: After much discussion on issues #33 and #46 we have decided to go back to the UTS46 string literal syntax like \q{string|literal|syntax}. It signals to practitioners that if they don't know what it is they need to look it up, and it leaves the door open for string-range/abbreviation syntax using either parentheses or a different backslash-with-new-letter combination.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants