ES2018 RegExp enhancements #673

mysticatea · 2018-02-10T14:52:14Z

Replaces RegExp validation by own validator. The validator satisfies the full-spec regular expressions based on A.8 Regular Expressions and B.1.4 Regular Expressions Patterns. The validator generates the same error messages as Node.js native RegExp implementation. (I checked the code coverage of the validator by nyc and made almost 100%)
Enhances the validator to support RegExp named capture groups (tc39/ecma262@95ec0c6)
Enhances the validator to support RegExp Unicode Property Escapes (tc39/ecma262@0ae3582)
Enhances the validator to support RegExp Lookbehind Assertions (tc39/ecma262@bf8a9be)

I'd like to get advice for the direction.

marijnh · 2018-02-10T16:04:02Z

This is great. One thing I'm a bit wary of is the code weight. This is written in a very, err, software-engineery way. The RegExp grammar isn't all that complicated, and I figure if we cut down some of the abstraction (numeric literals rather than constants, etc) this could be at least two times smaller. If you want i can take a stab at that.

Also, it may, though it makes little sense from a functionality perspective, be a good idea to make these parser methods instead of plain functions, so that plugins can add RegeExp features.

mysticatea · 2018-02-11T05:50:41Z

Thank you.

I agree, it's not slim. There are two reasons.

One is that my brain couldn't keep ASCII code in my memory and I want input completion to help me. I was not bothered about the constants since minification tools can fold those. But I'm OK those are removed after dev.
Another one is that I intended to make the validator's structure closing to the spec's structure as possible. I thought that it will help us to make enhancement easily in future since TC39 proposals are differences to the current spec.

Also, it may, though it makes little sense from a functionality perspective, be a good idea to make these parser methods instead of plain functions, so that plugins can add RegExp features.

I will move the functions into the validator.

adrianheine · 2018-02-11T19:13:18Z

https://github.com/jviereck/regjsparser and its tests might be a useful resource for this.

unicode → this.switchU namedGroups → this.switchN This makes easy to enhance the validator by plugins

This commit changes the approach validating values. Before, it has used `parseXxx(start, end)` methods after eating production. Now, each `eat` methods make `this.lastIntValue` while parsing, then it uses the `this.lastIntValue` to validate values.

marijnh · 2018-02-13T11:11:06Z

I will move the functions into the validator.

Plugins can not extend this class, though (I mean, I guess they sort of can, but not in a composable way). I was proposing for the methods that might be overridden to live on the Parser class.

mysticatea · 2018-02-13T11:22:24Z

I see. But I'm afraid it since the validator has a ton of state and methods to validate RegExp patterns. I'm not sure if the merging is better... For enhancement, is this.regexpValidator property not enough?

# Conflicts: # bin/run_test262.js

marijnh · 2018-02-13T11:37:36Z

For enhancement, is this.regexpValidator property not enough?

Not really. You should still keep the regexp-related state in a separate object, but put the main parsing functions on Parser, passing them this state as an argument.

mysticatea · 2018-02-13T11:41:33Z

OK, I will try it.

mysticatea · 2018-02-13T13:13:20Z

I moved the ton of methods into Parser.prototype. To clarify those methods are to validate RegExp patterns, I added validateRegExp_ prefix to those methods.

marijnh · 2018-02-13T16:15:55Z

src/regexp.js

+  }
+
+  // Node.js 0.12/0.10 don't support String.prototype.codePointAt().
+  codePointAt(i) {


Is there any case where it actually matters whether we have a single (non-syntax) character or two of them? Could we make do by just looking at code units not code points?

(If not, there may be a problem where this gets regexps without u flag wrong, since those should treat surrogate pairs as two characters. If yes, we could maybe drop the extra complexity around code points?)

Thank you for good catch. It was a problem. I fixed it and added tests.

ghost · 2018-02-14T06:22:36Z

I noticed this code isn't bullet proof and missing some validation. You can parse /([a ]\b)*\b/ and it works, but outputs invalid syntax. Invalid WS in this case. There are other cases as well like you need to check for opening brackets, braces / curly and parens. This will parse without an issue even if they are wrong.

mysticatea · 2018-02-14T06:53:05Z

@jonnymanok Thank you for the comment. I added a test for /([a ]\b)*\b/. It's valid syntax as same as V8.

adrianheine · 2018-02-22T10:00:24Z

Thanks for implementing this, @mysticatea, and thanks for reviewing, @marijnh! Running the regjsparser test suite on acorn, I found four failures:

/[\B]/ should not parse, but does
Three test cases test something like ((([a]+)?/?>?)?). I don't understand why that should pass. @mathiasbynens, could you give me a hint?

mathiasbynens · 2018-02-22T14:42:53Z

Re: 2, that looks like a mistake to me. cc @jviereck

mysticatea · 2018-02-22T16:38:24Z

@adrianheine /[\B]/ should be parsed because \B is equivalent to B in Annex B. (https://www.ecma-international.org/ecma-262/8.0/#prod-annexB-IdentityEscape)

jviereck · 2018-02-23T03:47:34Z

Three test cases test something like ((([a]+)?/?>?)?). I don't understand why that should pass. @mathiasbynens, could you give me a hint?
Re: 2, that looks like a mistake to me. cc @jviereck

I tried to figure out what you are referring to (is there a test in regjsparse that tests for this sequence) but couldn't find/figure it out. Happy to take a look once I know what you refer to. Thanks :)

adrianheine · 2018-02-23T09:16:57Z

@jviereck line 1254 in test-data.json for example

@mysticatea Right, regjsparser doesn't follow Annex B, I actually discussed that in jviereck/regjsparser#90 but forgot about it again.

mysticatea · 2018-02-23T09:25:38Z

About ((([a]+)?/?>?)?), A.8 Regular Expressions syntax allows it, so new RegExp("((([a]+)?/?>?)?)") is valid. However, RegularExpressionLiteral production stops at the / in the pattern and ((([a]+)? doesn't match to A.8 Regular Expressions, so /((([a]+)?/?>?)?)/ is syntax error.

jviereck · 2018-02-25T20:43:03Z

Three test cases test something like ((([a]+)?/?>?)?). I don't understand why that should pass. @mathiasbynens, could you give me a hint?

When I implemented regexp.js (where regjsparse.js emerged from) and I imported the test262 tests, I recall there were some RegExps matching XML that were failing. It looked like these tests in test262 should fail from reading the spec, but they did not and also browsers were happy. So, I recall adjusting the parser to make things work.

mysticatea · 2018-02-26T04:34:22Z

@marijnh Do you have a plan to publish a new version?

marijnh · 2018-02-26T08:53:27Z

Do you have time to look into the performance regression reported here first? That seems a lot for just parsing regexps, which should be only a small fraction of the source code by size—maybe if you profile something obvious will stand out.

ghost · 2018-02-26T11:34:09Z

@marijnh I did some research regarding the performance loss and found out that the performance loss is only existing for libraries that rely on heavy use of regular expressions such as jQuery and Angular. So the perf loss isn't existing for libs that uses a small amount of regexp. And also parsing out a single regexp or two does not generate any perf loss. Hope that help :)

I may think the issue or the bottle neck is where you create a new regexp prototype instance when validation regular expressions. Maybe try to cache this part somehow or something. Invoking the new keyword frequently can and will create perf loss. V8 - as far as I know - optimize for this differently.

marijnh · 2018-02-26T11:52:56Z

the performance loss is only existing for libraries that rely on heavy use of regular expressions such as jQuery and Angular

Looking through the source code for jquery I don't see all that many regular expressions, so I don't understand why validating them would produce such a noticeable slowdown.

I may think the issue or the bottle neck is where you create a new regexp prototype instance when validation regular expressions

This was done before the recent patches as well, and is required by the ESTree spec, so that's not the source of the regression and not something we can change.

ghost · 2018-02-26T13:01:38Z

@marijnh I looked into the regexp validation code. Not sure how to reproduce it, but you can use console.log. It looks like the code for indexing the chars are run multiple times even if they shouldn't, and there is used a "length check" to return if nothing to be done. Several places I noticed the same code even for a single regexp.

I also noticed in top of the script it reparse for some regular expressions. As far as I understand this code was modeled after V8, so I did a comparison and there is no reparsing there. Not sure if this is enough to do a perf regression.

It also seems to - (not 100% sure) that there are a few "code validation duplication" when parsing out unicode e.g. surrogate pairs.

mysticatea · 2018-02-27T02:42:27Z

In my env (Node 8.9.3, Windows 10), performance impact looks small.

"use strict"

const fs = require("fs")
const {performance} = require("perf_hooks")
const {parse} = require("./dist/acorn")
const code = fs.readFileSync("jquery-3.3.1.js", "utf8")

const times = []
for (let i = 0; i < 0x2FF; ++i) {
  const t0 = performance.now()
  parse(code, {ecmaVersion: 8})
  const t1 = performance.now()

  process.stdout.write(".")
  times.push(t1 - t0)
}
times.sort()
console.log("\nMEAN:", times[times.length / 2 | 0], "ms")

Before: MEAN: 20.827545000240207 ms (ba939ea)
After: MEAN: 21.117823000997305 ms (0d20f67) (+1.4%)

ghost · 2018-02-27T03:05:15Z

@mysticatea you are only testing jquery. Run acorn own benchmark and you will notice it. Also try benchmarking librares such as angular 1.6, jquery mobile, ts etc. Larger the libs are and more regular expressions, the perf drop compared to current acorn version.

Note that acorn benchmark are run in the browser, but i also tested it against nodejs 4.x, 8.x and latest.

mysticatea · 2018-02-27T03:21:37Z

Also I look node --prof profiling, but I couldn't find facts that the methods of this validator stands out: https://gist.github.com/mysticatea/57e34ba9be9b7a7676b583febf25d309

@jonnymanok Where is the benchmark script? I'm looking for npm run bench or something like.

mysticatea · 2018-02-27T03:39:21Z

I tried with jquery and jquery-mobile:

Before: MEAN: 80.36216400004923 ms ms (ba939ea)
After: MEAN: 80.5539809986949 ms (0d20f67) (+0.2%)

"use strict"

const fs = require("fs")
const {performance} = require("perf_hooks")
const {parse} = require("../dist/acorn")
const code = [
  fs.readFileSync("jquery-3.3.1.js", "utf8"),
  fs.readFileSync("jquery.mobile-git.js", "utf8")
].join("\n")

const times = []
for (let i = 0; i < 0x3FF; ++i) {
  const t0 = performance.now()
  parse(code, {ecmaVersion: 8})
  const t1 = performance.now()

  process.stdout.write(".")
  times.push(t1 - t0)
}
times.sort()
console.log("\nMEAN:", times[times.length / 2 | 0], "ms")

ghost · 2018-02-27T03:43:42Z

@mysticatea here

mysticatea · 2018-02-27T03:50:04Z

@jonnymanok Thanks. But how can I run it in my local with specific commit?

ghost · 2018-02-27T03:56:00Z

@mysticatea Compile it, and modify the source to run your locale copy. After DL the source from this repo.

Here is my current result on my computer. Please note that the results are different from computer to computer. And I ran default libs already existing in the benchmark now. I didn't add any others.

The React.js is the worse candidate in this library collection :)

mysticatea · 2018-02-27T03:59:25Z

Ah, I find it in test/bench directory. Thanks.

ghost · 2018-02-27T04:07:51Z

NP. And here is a screenshot from another computer I ran this on. I compared Acorn dev against both Esprima and the cherow parser. I still notice a perf regression both in jquery and react library.

mysticatea · 2018-02-27T04:31:47Z

I tried it with all commits between the current master and 5.4.1.

mysticatea · 2018-02-27T04:38:03Z

mysticatea · 2018-02-27T04:42:51Z

mysticatea · 2018-02-27T04:44:20Z

I didn't find performance regression... 🤔

mysticatea · 2018-02-27T04:49:28Z

In my env, the dev seems faster than 5.4.1.
Mysterious.

ghost · 2018-02-27T05:14:03Z

I ran out of time in my end, but I notice you didn't find much regression. Try to add Angular 1.6, typescript and jquery mobile. And also note that the results differ from computer to computer.
I ran this on a i5 7th gen laptop with windows 10, and a i3 4th gen desktop w/ Windows Vista

KFlash · 2018-02-27T06:17:20Z

Nothing obvious in current code that cause a huge perf loss as I could see. It's room for optimization and performance tweaks, but that's it. I tested this against jslint. Line 423 contain a very long regexp you can validate against.

marijnh · 2018-02-27T07:46:40Z

Okay, that sounds like there's no real problem. I've released 5.5.0

jdalton · 2018-02-28T06:22:24Z

Is there a way to disable this and defer to the given engine?

Update:

Maybe nooping pp.validateRegExpFlags and pp.validateRegExpPattern?

mysticatea force-pushed the regexp branch from 4327b35 to 996ab69 Compare February 10, 2018 14:58

replace RegExp validation

40ce6e9

mysticatea force-pushed the regexp branch from 996ab69 to 40ce6e9 Compare February 10, 2018 15:12

mysticatea added 3 commits February 11, 2018 15:42

refactor validator

9170ac7

rename this.pattern → this.source

7b04440

add RegExp named capture groups

6fb13e0

mysticatea force-pushed the regexp branch from 5e39d68 to 6fb13e0 Compare February 11, 2018 13:03

mysticatea added 4 commits February 13, 2018 14:41

move parameters to fields

9fa439a

unicode → this.switchU namedGroups → this.switchN This makes easy to enhance the validator by plugins

add RegExp Unicode property escapes

f7d0ef8

refactor

ff7de09

This commit changes the approach validating values. Before, it has used `parseXxx(start, end)` methods after eating production. Now, each `eat` methods make `this.lastIntValue` while parsing, then it uses the `this.lastIntValue` to validate values.

add RegExp lookbehind assertions

eb75208

Merge remote-tracking branch 'upstream/master' into regexp

aaf6e9d

# Conflicts: # bin/run_test262.js

refactor: move methods to Parser.prototype

4164a1f

mysticatea changed the title ~~[WIP] ES2018 RegExp enhancements~~ ES2018 RegExp enhancements Feb 13, 2018

small fix

68fd78e

mysticatea force-pushed the regexp branch from 4434d9b to 68fd78e Compare February 13, 2018 13:20

marijnh reviewed Feb 13, 2018

View reviewed changes

use code unit without u flag

82fd89b

mysticatea force-pushed the regexp branch from 7a008b1 to 82fd89b Compare February 14, 2018 01:16

add a test

4bf6c11

ES2018 RegExp enhancements #673

ES2018 RegExp enhancements #673

Conversation

mysticatea commented Feb 10, 2018 • edited Loading

marijnh commented Feb 10, 2018

mysticatea commented Feb 11, 2018 • edited Loading

adrianheine commented Feb 11, 2018 • edited Loading

marijnh commented Feb 13, 2018

mysticatea commented Feb 13, 2018

marijnh commented Feb 13, 2018

mysticatea commented Feb 13, 2018

mysticatea commented Feb 13, 2018

marijnh Feb 13, 2018 • edited Loading

Choose a reason for hiding this comment

mysticatea Feb 14, 2018

Choose a reason for hiding this comment

ghost commented Feb 14, 2018

mysticatea commented Feb 14, 2018 • edited Loading

adrianheine commented Feb 22, 2018

mathiasbynens commented Feb 22, 2018 • edited Loading

mysticatea commented Feb 22, 2018

jviereck commented Feb 23, 2018

adrianheine commented Feb 23, 2018

mysticatea commented Feb 23, 2018 • edited Loading

jviereck commented Feb 25, 2018

mysticatea commented Feb 26, 2018

marijnh commented Feb 26, 2018

ghost commented Feb 26, 2018 • edited by ghost Loading

marijnh commented Feb 26, 2018

ghost commented Feb 26, 2018 • edited by ghost Loading

mysticatea commented Feb 27, 2018

ghost commented Feb 27, 2018 • edited by ghost Loading

mysticatea commented Feb 27, 2018 • edited Loading

mysticatea commented Feb 27, 2018

ghost commented Feb 27, 2018

mysticatea commented Feb 27, 2018 • edited Loading

ghost commented Feb 27, 2018 • edited by ghost Loading

mysticatea commented Feb 27, 2018

ghost commented Feb 27, 2018

mysticatea commented Feb 27, 2018

mysticatea commented Feb 27, 2018 • edited Loading

mysticatea commented Feb 27, 2018

mysticatea commented Feb 27, 2018

mysticatea commented Feb 27, 2018

ghost commented Feb 27, 2018

KFlash commented Feb 27, 2018 • edited Loading

marijnh commented Feb 27, 2018

jdalton commented Feb 28, 2018 • edited Loading

mysticatea commented Feb 10, 2018 •

edited

Loading

mysticatea commented Feb 11, 2018 •

edited

Loading

adrianheine commented Feb 11, 2018 •

edited

Loading

marijnh Feb 13, 2018 •

edited

Loading

mysticatea commented Feb 14, 2018 •

edited

Loading

mathiasbynens commented Feb 22, 2018 •

edited

Loading

mysticatea commented Feb 23, 2018 •

edited

Loading

ghost commented Feb 26, 2018 •

edited by ghost

Loading

ghost commented Feb 26, 2018 •

edited by ghost

Loading

ghost commented Feb 27, 2018 •

edited by ghost

Loading

mysticatea commented Feb 27, 2018 •

edited

Loading

mysticatea commented Feb 27, 2018 •

edited

Loading

ghost commented Feb 27, 2018 •

edited by ghost

Loading

mysticatea commented Feb 27, 2018 •

edited

Loading

KFlash commented Feb 27, 2018 •

edited

Loading

jdalton commented Feb 28, 2018 •

edited

Loading