-
Notifications
You must be signed in to change notification settings - Fork 12
Add syntax for string literals #17
Comments
ICU UnicodeSet pattern strings have long supported multi-character strings via curly braces. For example: CLDR uses this syntax in its data files. UTS #18 suggests Note that when you have multiple strings, even just the curly braces without prefix get hard to read. In our discussions, we came up with using curly braces, but also adding the pipe symbol I really like this improved syntax. |
I do as well, far easier to read. The working group on UTS #18 Unicode Regular Expressions is also recommending this for the next version.
|
Two random thoughts:
|
I've expressed this before, but I have a mild to medium preference for either
I'm not a fan of bare I am okay with |
Yes, inside {} the literal characters |, \, and } require quoting.
I strongly favor {x} working. For one thing, then the user doesn't have to
look up to know that {x̣} requires {} but that {ẍ} doesn't require them.
When in doubt, the user can always use {}.
…On Thu, Apr 8, 2021 at 1:41 PM Mathias Bynens ***@***.***> wrote:
Two random thoughts:
1.
I’m assuming we want to allow string values containing the |
character, by escaping it as \|. Agree/disagree?
2.
Do we want {x} (where x is a single code point) to be allowed or
disallowed?
- Argument in favor of allowing: that way, {…} can be used for any
non-empty string.
- Argument in favor of disallowing: {x} is just a different way of
writing x, so we don’t need to allow it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBVHCUDUARTN2RKDF3THYIJNANCNFSM42TJKEIQ>
.
|
Reasons for using {} for the syntax characters inside of brackets
1. is that they are quite uncommon, whereas () much more likely to be
needed as literals.
2. "because when I read {}, I expect to find a modifier that tells me
what the {} means." In other words, people expect {} to be syntax
characters, and are likely to pause and consider what they would mean. They
are less likely to do that with ().
Mark
…On Thu, Apr 8, 2021 at 1:46 PM Shane F. Carr ***@***.***> wrote:
I've expressed this before, but I have a mild preference for either
\q{abc} or (abc).
[a-zA-Z\q{ch}\q{m̀}\q{か゚}\q{🇦🇺}\q{🇧🇪}\q{🇫🇷}]
[a-zA-Z(ch|m̀|か゚|🇦🇺|🇧🇪|🇫🇷)]
I'm not a fan of bare {} because when I read {}, I expect to find a
modifier that tells me what the {} means. If there is a regular
expression like /[p{letter}]/v, it looks like I am looking up a Unicode
property "letter", but actually I am just matching the alternation "p" and
"letter".
I am okay with () because that syntax is already widely understood as
being an alternation of strings.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMHASWDIF2SR4ZU5LNLTHYI3XANCNFSM42TJKEIQ>
.
|
That's just not true. In ECMAScript Regex,
Do you have evidence for that claim? I don't think it's any more or less likely to write a regular expression that tries to match |
Yes, absolutely. Character escapes should work inside
Yes, absolutely. And if ES regex allows an empty string in an alternation -- |
Another argument for using
Now I decide that I want to perform a set operation on the alternation. With
With any of the alternatives proposed, I would need to re-write my string alternation to some other syntax that works inside |
As for braces vs. parentheses: AFAIK the only comparable existing practice is in ICU/CLDR (just braces) and in UTS #18 (braces with In any case I want to lobby for multiple strings with a separator. If we were to go with the prefix, that would be |
We settled on using parentheses, since it most closely matches the existing syntax for alternating between strings outside of character classes. One difference is that |
We have advanced to stage 2 with string literals like |
Another being that it matches left-to-right outside, longest-to-shortest inside. Let's look at some syntax precedent. If you want to have string literal syntax that looks the same outside and inside, make it behave the same outside and inside. For that I propose modifier
And of course you could use this |
|
Quite the opposite. If I read the temporary grammar correctly, ClassStrings limit what can appear inside |
Update: After much discussion on issues #33 and #46 we have decided to go back to the UTS46 string literal syntax like |
During the April 2021 TC39 Incubator call, we received feedback that the addition of string literal syntax is necessary, to enable things like “match all RGI_Emoji except for the Belgian flag”.
We tried to leave it out of the proposal before in an attempt to keep it minimal, but now it’s become clear that it’s crucial functionality that should be included from the start.
What should the syntax be? UTS18 has
\q{…}
: https://unicode.org/reports/tr18/#Character_Ranges_with_Strings But we also considered not having a prefix at all, and instead just using{…}
. Thoughts? Opinions?The text was updated successfully, but these errors were encountered: