String composition syntax #46

sffc · 2021-09-23T17:37:17Z

We have established that sets of strings need to be bounded.

Can we introduce helper syntax to make large-but-bounded sets of strings more friendly to write?

For example:

Question mark: /[(ab?c)]/ == /[(ac|abc)]/
Braces with a minimum and maximum: /[(ab{1,3}c)]/ == /[(ac|abc|abbc|abbbc)]/
Prefixing or suffixing of nested sets: /[(a[bc]d)]/ == /[(abd|acd)]/
Prefixing or suffixing of nested properties: /[(a\p{L}d)]/ == the set of all letters prefixed with "a" and suffixed with "d"

I can think of a number of use cases:

Emoji sequences where the skin-tone code point is optional
Alternate spellings of words
Regex compression

The text was updated successfully, but these errors were encountered:

markusicu · 2021-09-23T18:34:46Z

My initial reaction:

Oh no, not another complication!
Something like this might be ok if it's pretty explicit; I don't like putting a property in there like "all letters" which can be large and which will grow over time.
For CLDR, Mark has cooked up something similar, a concise syntax for "ranges" of strings, something like abcd~df=abcd|abce|abcf|abdd|abde|abdf. Used to compress language subtag validity data.
I don't want this to delay our proposal, but if we might want to do something like this, then we would need to at least require escaping of a lot of potential syntax characters inside string literals.

RunDevelopment · 2021-09-23T18:37:27Z

I agree that this would be a really nice feature but implementing it will be difficult. E.g. /[([ab]{64})]/ will resolve into 2⁶⁴ strings. This (really nice) feature will probably take us further away from ICU's UnicodeSet which (to my knowledge) is implemented as a character set and a list of strings.

macchiati · 2021-09-23T18:43:41Z

My initial reaction is much like Markus's; let's not let this delay or derail the current proposal. Mark

…

On Thu, Sep 23, 2021 at 11:34 AM Markus Scherer ***@***.***> wrote: My initial reaction: 1. Oh no, not another complication! 2. *Something* like this *might* be ok if it's pretty explicit; I don't like putting a property in there like "all letters" which can be large and which will grow over time. 3. For CLDR, Mark has cooked up something similar, a concise syntax for "ranges" of strings, something like abcd~df=abcd|abce|abcf|abdd|abde|abdf. Used to compress language subtag validity data. 4. I don't want this to delay our proposal, but if we might want to do something like this, then we would need to at least require escaping of a lot of potential syntax characters inside string literals. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#46 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMB3AXPZVEITRVL5KWLUDNXNFANCNFSM5EUGL4ZQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

waldemarhorwat · 2021-09-23T20:46:37Z

Are we going to allow general regexps inside character classes? That seems to be the simplest way of doing this. If we go that route:

We need to allow character classes to include infinite sets.
Things like backtracking evaluation order and capturing parentheses will get weird.

markusicu · 2021-09-23T21:08:43Z

Are we going to allow general regexps inside character classes?

I really do not want to go there.
And that's not what Shane is suggesting here -- “We have established that sets of strings need to be bounded.” -- he is “only” suggesting ways to abbreviate a finite list of literal strings.

That is, if and when we do this, we would end up with an algorithm in the spec for how to expand strings-with-wildcards into a fixed set of strings, rather than turning the result of a character class into some sort of nested-regex matcher.

As Michael points out, depending on what wildcards are supported, this could easily yield an astronomical number of strings and thus eat a lot of memory, so we should think about security implications.

RunDevelopment · 2021-09-23T21:54:59Z

depending on what wildcards are supported, this could easily yield an astronomical number of strings

Almost any wildcard that resolves into >1 strings can be used to cause a combinatorial explosion. I don't think that there are any useful wildcards that can be implemented safely if they all get de-sugared into strings.

Examples:

Character class: /[([a-z][a-z][a-z][a-z])]/ accepts 26⁴ strings.
Character set: /[(\w\w\w\w)]/ accepts 63⁴ strings.
Single character set + suffix: /[(\Wa)]/ accepts >1M strings.
Quantifier: /[(a{1,100}b{1,100}c{1,100}d{1,100})]/ accepts 100⁴ strings.
Question mark quantifier: /[(0?1?2?3?4?5?6?7?8?9?_0?1?2?3?4?5?6?7?8?9?)]/ accepts 2²⁰ strings.

mathiasbynens · 2021-09-24T06:58:48Z

+1 to exploring this further as a separate follow-up proposal.

We don’t need to do anything special as part of this proposal since \X (where X is an ASCII letter that currently doesn’t have a special escape sequence) is already reserved in the current upstream spec in u mode (and will also be in v mode). (We made sure of that here: https://web.archive.org/web/20141214085510/https://bugs.ecmascript.org/show_bug.cgi?id=3157)

If after further investigation we decide to add this functionality, we could handle it by adding a new prefix alongside \q{…} (for simple strings).

markusicu · 2021-09-30T22:43:52Z

Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane.
We decided to not pursue these ideas in this proposal.
A future proposal could introduce string-range/abbreviation syntax using either parentheses or a backslash-with-new-letter combination different from \q{...}.

Issue: #33 #46

sffc mentioned this issue Sep 23, 2021

More syntax characters should be forbidden in ClassSyntaxCharacter #33

Closed

markusicu mentioned this issue Sep 30, 2021

Add syntax for string literals #17

Closed

markusicu closed this as completed Sep 30, 2021

mathiasbynens pushed a commit that referenced this issue Oct 1, 2021

Revert back to backslash-q string literals (#47)

3a4c142

Issue: #33 #46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String composition syntax #46

String composition syntax #46

sffc commented Sep 23, 2021

markusicu commented Sep 23, 2021

RunDevelopment commented Sep 23, 2021 •

edited

Loading

macchiati commented Sep 23, 2021 via email

waldemarhorwat commented Sep 23, 2021

markusicu commented Sep 23, 2021

RunDevelopment commented Sep 23, 2021

mathiasbynens commented Sep 24, 2021

markusicu commented Sep 30, 2021

String composition syntax #46

String composition syntax #46

Comments

sffc commented Sep 23, 2021

markusicu commented Sep 23, 2021

RunDevelopment commented Sep 23, 2021 • edited Loading

macchiati commented Sep 23, 2021 via email

waldemarhorwat commented Sep 23, 2021

markusicu commented Sep 23, 2021

RunDevelopment commented Sep 23, 2021

mathiasbynens commented Sep 24, 2021

markusicu commented Sep 30, 2021

RunDevelopment commented Sep 23, 2021 •

edited

Loading