Skip to content
This repository has been archived by the owner on Feb 16, 2024. It is now read-only.

String composition syntax #46

Closed
sffc opened this issue Sep 23, 2021 · 8 comments
Closed

String composition syntax #46

sffc opened this issue Sep 23, 2021 · 8 comments

Comments

@sffc
Copy link
Collaborator

sffc commented Sep 23, 2021

We have established that sets of strings need to be bounded.

Can we introduce helper syntax to make large-but-bounded sets of strings more friendly to write?

For example:

  1. Question mark: /[(ab?c)]/ == /[(ac|abc)]/
  2. Braces with a minimum and maximum: /[(ab{1,3}c)]/ == /[(ac|abc|abbc|abbbc)]/
  3. Prefixing or suffixing of nested sets: /[(a[bc]d)]/ == /[(abd|acd)]/
  4. Prefixing or suffixing of nested properties: /[(a\p{L}d)]/ == the set of all letters prefixed with "a" and suffixed with "d"

I can think of a number of use cases:

  • Emoji sequences where the skin-tone code point is optional
  • Alternate spellings of words
  • Regex compression
@markusicu
Copy link
Collaborator

My initial reaction:

  1. Oh no, not another complication!
  2. Something like this might be ok if it's pretty explicit; I don't like putting a property in there like "all letters" which can be large and which will grow over time.
  3. For CLDR, Mark has cooked up something similar, a concise syntax for "ranges" of strings, something like abcd~df=abcd|abce|abcf|abdd|abde|abdf. Used to compress language subtag validity data.
  4. I don't want this to delay our proposal, but if we might want to do something like this, then we would need to at least require escaping of a lot of potential syntax characters inside string literals.

@RunDevelopment
Copy link

RunDevelopment commented Sep 23, 2021

I agree that this would be a really nice feature but implementing it will be difficult. E.g. /[([ab]{64})]/ will resolve into 264 strings. This (really nice) feature will probably take us further away from ICU's UnicodeSet which (to my knowledge) is implemented as a character set and a list of strings.

@macchiati
Copy link
Collaborator

macchiati commented Sep 23, 2021 via email

@waldemarhorwat
Copy link

Are we going to allow general regexps inside character classes? That seems to be the simplest way of doing this. If we go that route:

  • We need to allow character classes to include infinite sets.
  • Things like backtracking evaluation order and capturing parentheses will get weird.

@markusicu
Copy link
Collaborator

Are we going to allow general regexps inside character classes?

I really do not want to go there.
And that's not what Shane is suggesting here -- “We have established that sets of strings need to be bounded.” -- he is “only” suggesting ways to abbreviate a finite list of literal strings.

That is, if and when we do this, we would end up with an algorithm in the spec for how to expand strings-with-wildcards into a fixed set of strings, rather than turning the result of a character class into some sort of nested-regex matcher.

As Michael points out, depending on what wildcards are supported, this could easily yield an astronomical number of strings and thus eat a lot of memory, so we should think about security implications.

@RunDevelopment
Copy link

depending on what wildcards are supported, this could easily yield an astronomical number of strings

Almost any wildcard that resolves into >1 strings can be used to cause a combinatorial explosion. I don't think that there are any useful wildcards that can be implemented safely if they all get de-sugared into strings.

Examples:

  • Character class: /[([a-z][a-z][a-z][a-z])]/ accepts 264 strings.
  • Character set: /[(\w\w\w\w)]/ accepts 634 strings.
  • Single character set + suffix: /[(\Wa)]/ accepts >1M strings.
  • Quantifier: /[(a{1,100}b{1,100}c{1,100}d{1,100})]/ accepts 1004 strings.
  • Question mark quantifier: /[(0?1?2?3?4?5?6?7?8?9?_0?1?2?3?4?5?6?7?8?9?)]/ accepts 220 strings.

@mathiasbynens
Copy link
Member

+1 to exploring this further as a separate follow-up proposal.

We don’t need to do anything special as part of this proposal since \X (where X is an ASCII letter that currently doesn’t have a special escape sequence) is already reserved in the current upstream spec in u mode (and will also be in v mode). (We made sure of that here: https://web.archive.org/web/20141214085510/https://bugs.ecmascript.org/show_bug.cgi?id=3157)

If after further investigation we decide to add this functionality, we could handle it by adding a new prefix alongside \q{…} (for simple strings).

@markusicu
Copy link
Collaborator

Discussed today with Markus, Mathias, Richard, Mark, Bradley, Shane.
We decided to not pursue these ideas in this proposal.
A future proposal could introduce string-range/abbreviation syntax using either parentheses or a backslash-with-new-letter combination different from \q{...}.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants