Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow irregular patterns #15

Open
bergus opened this issue Aug 12, 2015 · 8 comments
Open

Disallow irregular patterns #15

bergus opened this issue Aug 12, 2015 · 8 comments
Assignees

Comments

@bergus
Copy link
Collaborator

bergus commented Aug 12, 2015

The current regex grammar is quite liberal about SyntaxCharacters that appear in places where they are not allowed. For example, /a{/ is just taken literally.

This complicates parsing, and deciding what context we are in to escape the interpolation value. As an example, what would RegExp.make a{${"1,}b"}do, is it equivalent to/a{1,}b/`?

We should ensure that only well-formed patterns (where interpolation "holes" can produce Atoms, DecimalDigits etc) are allowed as arguments to make, and everything else throw a SyntaxError.

@ljharb
Copy link
Collaborator

ljharb commented Aug 12, 2015

It'd be ideal if we didn't have to use try/catch to handle error cases :-/ one of the primary use cases is handling user input, and if it could throw (like JSON.parse) then our code has to be littered with try/catches, or wrapped in a Promise.

@bergus
Copy link
Collaborator Author

bergus commented Aug 12, 2015

@ljharb Nah, I meant that RegExp.make a{ ${input}`` always throws, regardless of the user input, because unmatched braces are a programmer mistake (and make the implementation of `make` harder).
Of course, `RegExp.make` would not throw on the interpolation values (as long as they have the expected `Type`, e.g. string/regex, and don't throw on coercion); all character sequences are valid and do not need to meet any pattern.

@ljharb
Copy link
Collaborator

ljharb commented Aug 12, 2015

ah, ok awesome - throws for programmer input are awesome :-)

@mikesamuel
Copy link
Owner

So throw on errors in literal sections per early errors but should never throw because of an interpolated value?

@bergus
Copy link
Collaborator Author

bergus commented Aug 12, 2015

Well, it wouldn't be exactly a compile-time early error (like they are possible for regex literals), as RegExp.make gets to see the raw strings not before its invocation, but yes, it would only throw these errors because of syntactic issues in the literal sections. Interpolation values are always escaped literally, and don't need to conform to any syntactic grammar.

should never throw because of an interpolated value?

OT, but a strict "never" is not possible of course. We will always need to throw in cases like

RegExp.make `x{${ -1 }}` // repeat -1 times
RegExp.male `^${ {toString(){ throw "buh!"; }} }` // better not coerce that

To precise: RegExp.make must never throw when interpolating a primitive string value in the context of an Atom (the original purpose of RegExp.escape).

@mikesamuel
Copy link
Owner

Fair enough.

So constructs that are problematic from regexp_match_web_reality:

  1. \c not followed by a control code
  2. \x not followed by 2 x hex and \u not followed by (4 x hex or { hex+ })
  3. { not followed by a proper range then a } or not interpolable to the same.
  4. [ not closed by a ]
  5. character groups like \w used in a charset range
  6. zero-width assertions like \B used in a charset.

Not on that list, but patterns that could indicate a disconnect between author and interpreter:

  1. (? followed by a character that has a special meaning in other regular expression systems like (?i:, (?i, (?<!
  2. unicode character classes \p, [:Lu:] in character classes.

@bergus
Copy link
Collaborator Author

bergus commented Aug 13, 2015

Oh, there's already a list of them? Yeah, it would be nice if we could deprecate them right from the beginning :-)
I'm specifically concerned about everything that would change the context for interpolation holes, like 3) for quantifiers or 4) for classes, but might not if it's not properly closed. Also the incomplete escape sequences are a hazard if directly followed by an interpolated value.

I think we will need to allow those patterns like (?i:, (?i, (?<! though, as they could be used by an extend regular grammar (of the engine) or a subclass (userland). RegExp.make doesn't need to deal with them; if not supported the RegExp constructor will throw on them anyway (or are the browsers where it doesn't?).
However, a pattern like …(?${input})… should be considered irregular indeed.

@mikesamuel
Copy link
Owner

Agreed on (?${input}).

To summarize:

We should fail on interpolations in some contexts.

  • \c${...}
  • \u${...}
  • \x${...}
  • (?${...})`

Re lookbehind, should we dissallow interplations in

  • (?<${...})

where the valence of the interpolation has not yet been specified.

What about in charsets on either side of a -?

  • [a-${...}]
  • [${...}-z]

@mikesamuel mikesamuel self-assigned this Oct 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants