Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES2018 RegExp enhancements #673

Closed
wants to merge 19 commits into from
Closed

Conversation

mysticatea
Copy link
Contributor

@mysticatea mysticatea commented Feb 10, 2018

  • Replaces RegExp validation by own validator. The validator satisfies the full-spec regular expressions based on A.8 Regular Expressions and B.1.4 Regular Expressions Patterns. The validator generates the same error messages as Node.js native RegExp implementation. (I checked the code coverage of the validator by nyc and made almost 100%)
  • Enhances the validator to support RegExp named capture groups (tc39/ecma262@95ec0c6)
  • Enhances the validator to support RegExp Unicode Property Escapes (tc39/ecma262@0ae3582)
  • Enhances the validator to support RegExp Lookbehind Assertions (tc39/ecma262@bf8a9be)

I'd like to get advice for the direction.

@marijnh
Copy link
Member

marijnh commented Feb 10, 2018

This is great. One thing I'm a bit wary of is the code weight. This is written in a very, err, software-engineery way. The RegExp grammar isn't all that complicated, and I figure if we cut down some of the abstraction (numeric literals rather than constants, etc) this could be at least two times smaller. If you want i can take a stab at that.

Also, it may, though it makes little sense from a functionality perspective, be a good idea to make these parser methods instead of plain functions, so that plugins can add RegeExp features.

@mysticatea
Copy link
Contributor Author

mysticatea commented Feb 11, 2018

Thank you.

I agree, it's not slim. There are two reasons.

  • One is that my brain couldn't keep ASCII code in my memory and I want input completion to help me. I was not bothered about the constants since minification tools can fold those. But I'm OK those are removed after dev.
  • Another one is that I intended to make the validator's structure closing to the spec's structure as possible. I thought that it will help us to make enhancement easily in future since TC39 proposals are differences to the current spec.

Also, it may, though it makes little sense from a functionality perspective, be a good idea to make these parser methods instead of plain functions, so that plugins can add RegExp features.

I will move the functions into the validator.

@adrianheine
Copy link
Member

adrianheine commented Feb 11, 2018

https://github.com/jviereck/regjsparser and its tests might be a useful resource for this.

unicode → this.switchU
namedGroups → this.switchN

This makes easy to enhance the validator by plugins
This commit changes the approach validating values.
Before, it has used `parseXxx(start, end)` methods after eating
production.
Now, each `eat` methods make `this.lastIntValue` while parsing, then it
uses the `this.lastIntValue` to validate values.
@marijnh
Copy link
Member

marijnh commented Feb 13, 2018

I will move the functions into the validator.

Plugins can not extend this class, though (I mean, I guess they sort of can, but not in a composable way). I was proposing for the methods that might be overridden to live on the Parser class.

@mysticatea
Copy link
Contributor Author

I see. But I'm afraid it since the validator has a ton of state and methods to validate RegExp patterns. I'm not sure if the merging is better... For enhancement, is this.regexpValidator property not enough?

@marijnh
Copy link
Member

marijnh commented Feb 13, 2018

For enhancement, is this.regexpValidator property not enough?

Not really. You should still keep the regexp-related state in a separate object, but put the main parsing functions on Parser, passing them this state as an argument.

@mysticatea
Copy link
Contributor Author

OK, I will try it.

@mysticatea mysticatea changed the title [WIP] ES2018 RegExp enhancements ES2018 RegExp enhancements Feb 13, 2018
@mysticatea
Copy link
Contributor Author

I moved the ton of methods into Parser.prototype. To clarify those methods are to validate RegExp patterns, I added validateRegExp_ prefix to those methods.

src/regexp.js Outdated
}

// Node.js 0.12/0.10 don't support String.prototype.codePointAt().
codePointAt(i) {
Copy link
Member

@marijnh marijnh Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where it actually matters whether we have a single (non-syntax) character or two of them? Could we make do by just looking at code units not code points?

(If not, there may be a problem where this gets regexps without u flag wrong, since those should treat surrogate pairs as two characters. If yes, we could maybe drop the extra complexity around code points?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for good catch. It was a problem. I fixed it and added tests.

@ghost
Copy link

ghost commented Feb 14, 2018

I noticed this code isn't bullet proof and missing some validation. You can parse /([a ]\b)*\b/ and it works, but outputs invalid syntax. Invalid WS in this case. There are other cases as well like you need to check for opening brackets, braces / curly and parens. This will parse without an issue even if they are wrong.

@mysticatea
Copy link
Contributor Author

mysticatea commented Feb 14, 2018

@jonnymanok Thank you for the comment. I added a test for /([a ]\b)*\b/. It's valid syntax as same as V8.

@adrianheine
Copy link
Member

Thanks for implementing this, @mysticatea, and thanks for reviewing, @marijnh! Running the regjsparser test suite on acorn, I found four failures:

  1. /[\B]/ should not parse, but does
  2. Three test cases test something like ((([a]+)?/?>?)?). I don't understand why that should pass. @mathiasbynens, could you give me a hint?

@mathiasbynens
Copy link
Contributor

mathiasbynens commented Feb 22, 2018

Re: 2, that looks like a mistake to me. cc @jviereck

@mysticatea
Copy link
Contributor Author

@adrianheine /[\B]/ should be parsed because \B is equivalent to B in Annex B. (https://www.ecma-international.org/ecma-262/8.0/#prod-annexB-IdentityEscape)

@jviereck
Copy link

Three test cases test something like ((([a]+)?/?>?)?). I don't understand why that should pass. @mathiasbynens, could you give me a hint?
Re: 2, that looks like a mistake to me. cc @jviereck

I tried to figure out what you are referring to (is there a test in regjsparse that tests for this sequence) but couldn't find/figure it out. Happy to take a look once I know what you refer to. Thanks :)

@adrianheine
Copy link
Member

@jviereck line 1254 in test-data.json for example

@mysticatea Right, regjsparser doesn't follow Annex B, I actually discussed that in jviereck/regjsparser#90 but forgot about it again.

@mysticatea
Copy link
Contributor Author

mysticatea commented Feb 23, 2018

About ((([a]+)?/?>?)?), A.8 Regular Expressions syntax allows it, so new RegExp("((([a]+)?/?>?)?)") is valid. However, RegularExpressionLiteral production stops at the / in the pattern and ((([a]+)? doesn't match to A.8 Regular Expressions, so /((([a]+)?/?>?)?)/ is syntax error.

@jviereck
Copy link

Three test cases test something like ((([a]+)?/?>?)?). I don't understand why that should pass. @mathiasbynens, could you give me a hint?

When I implemented regexp.js (where regjsparse.js emerged from) and I imported the test262 tests, I recall there were some RegExps matching XML that were failing. It looked like these tests in test262 should fail from reading the spec, but they did not and also browsers were happy. So, I recall adjusting the parser to make things work.

@mysticatea
Copy link
Contributor Author

@marijnh Do you have a plan to publish a new version?

@marijnh
Copy link
Member

marijnh commented Feb 26, 2018

Do you have time to look into the performance regression reported here first? That seems a lot for just parsing regexps, which should be only a small fraction of the source code by size—maybe if you profile something obvious will stand out.

@ghost
Copy link

ghost commented Feb 26, 2018

@marijnh I did some research regarding the performance loss and found out that the performance loss is only existing for libraries that rely on heavy use of regular expressions such as jQuery and Angular. So the perf loss isn't existing for libs that uses a small amount of regexp. And also parsing out a single regexp or two does not generate any perf loss. Hope that help :)

I may think the issue or the bottle neck is where you create a new regexp prototype instance when validation regular expressions. Maybe try to cache this part somehow or something. Invoking the new keyword frequently can and will create perf loss. V8 - as far as I know - optimize for this differently.

@marijnh
Copy link
Member

marijnh commented Feb 26, 2018

the performance loss is only existing for libraries that rely on heavy use of regular expressions such as jQuery and Angular

Looking through the source code for jquery I don't see all that many regular expressions, so I don't understand why validating them would produce such a noticeable slowdown.

I may think the issue or the bottle neck is where you create a new regexp prototype instance when validation regular expressions

This was done before the recent patches as well, and is required by the ESTree spec, so that's not the source of the regression and not something we can change.

@ghost
Copy link

ghost commented Feb 26, 2018

@marijnh I looked into the regexp validation code. Not sure how to reproduce it, but you can use console.log. It looks like the code for indexing the chars are run multiple times even if they shouldn't, and there is used a "length check" to return if nothing to be done. Several places I noticed the same code even for a single regexp.

I also noticed in top of the script it reparse for some regular expressions. As far as I understand this code was modeled after V8, so I did a comparison and there is no reparsing there. Not sure if this is enough to do a perf regression.

It also seems to - (not 100% sure) that there are a few "code validation duplication" when parsing out unicode e.g. surrogate pairs.

@mysticatea
Copy link
Contributor Author

In my env (Node 8.9.3, Windows 10), performance impact looks small.

"use strict"

const fs = require("fs")
const {performance} = require("perf_hooks")
const {parse} = require("./dist/acorn")
const code = fs.readFileSync("jquery-3.3.1.js", "utf8")

const times = []
for (let i = 0; i < 0x2FF; ++i) {
  const t0 = performance.now()
  parse(code, {ecmaVersion: 8})
  const t1 = performance.now()

  process.stdout.write(".")
  times.push(t1 - t0)
}
times.sort()
console.log("\nMEAN:", times[times.length / 2 | 0], "ms")

Before: MEAN: 20.827545000240207 ms (ba939ea)
After: MEAN: 21.117823000997305 ms (0d20f67) (+1.4%)

@ghost
Copy link

ghost commented Feb 27, 2018

@mysticatea you are only testing jquery. Run acorn own benchmark and you will notice it. Also try benchmarking librares such as angular 1.6, jquery mobile, ts etc. Larger the libs are and more regular expressions, the perf drop compared to current acorn version.

Note that acorn benchmark are run in the browser, but i also tested it against nodejs 4.x, 8.x and latest.

@mysticatea
Copy link
Contributor Author

mysticatea commented Feb 27, 2018

Also I look node --prof profiling, but I couldn't find facts that the methods of this validator stands out: https://gist.github.com/mysticatea/57e34ba9be9b7a7676b583febf25d309

@jonnymanok Where is the benchmark script? I'm looking for npm run bench or something like.

@mysticatea
Copy link
Contributor Author

I tried with jquery and jquery-mobile:

Before: MEAN: 80.36216400004923 ms ms (ba939ea)
After: MEAN: 80.5539809986949 ms (0d20f67) (+0.2%)

"use strict"

const fs = require("fs")
const {performance} = require("perf_hooks")
const {parse} = require("../dist/acorn")
const code = [
  fs.readFileSync("jquery-3.3.1.js", "utf8"),
  fs.readFileSync("jquery.mobile-git.js", "utf8")
].join("\n")

const times = []
for (let i = 0; i < 0x3FF; ++i) {
  const t0 = performance.now()
  parse(code, {ecmaVersion: 8})
  const t1 = performance.now()

  process.stdout.write(".")
  times.push(t1 - t0)
}
times.sort()
console.log("\nMEAN:", times[times.length / 2 | 0], "ms")

@ghost
Copy link

ghost commented Feb 27, 2018

@mysticatea here

@mysticatea
Copy link
Contributor Author

mysticatea commented Feb 27, 2018

@jonnymanok Thanks. But how can I run it in my local with specific commit?

@ghost
Copy link

ghost commented Feb 27, 2018

@mysticatea Compile it, and modify the source to run your locale copy. After DL the source from this repo.

Here is my current result on my computer. Please note that the results are different from computer to computer. And I ran default libs already existing in the benchmark now. I didn't add any others.

The React.js is the worse candidate in this library collection :)

new_bench

@mysticatea
Copy link
Contributor Author

Ah, I find it in test/bench directory. Thanks.

@ghost
Copy link

ghost commented Feb 27, 2018

NP. And here is a screenshot from another computer I ran this on. I compared Acorn dev against both Esprima and the cherow parser. I still notice a perf regression both in jquery and react library.
bench1

@mysticatea
Copy link
Contributor Author

I tried it with all commits between the current master and 5.4.1.

image

@mysticatea
Copy link
Contributor Author

mysticatea commented Feb 27, 2018

image

@mysticatea
Copy link
Contributor Author

image

@mysticatea
Copy link
Contributor Author

I didn't find performance regression... 🤔

@mysticatea
Copy link
Contributor Author

In my env, the dev seems faster than 5.4.1.
Mysterious.

image

@ghost
Copy link

ghost commented Feb 27, 2018

I ran out of time in my end, but I notice you didn't find much regression. Try to add Angular 1.6, typescript and jquery mobile. And also note that the results differ from computer to computer.
I ran this on a i5 7th gen laptop with windows 10, and a i3 4th gen desktop w/ Windows Vista

@KFlash
Copy link

KFlash commented Feb 27, 2018

Nothing obvious in current code that cause a huge perf loss as I could see. It's room for optimization and performance tweaks, but that's it. I tested this against jslint. Line 423 contain a very long regexp you can validate against.

@marijnh
Copy link
Member

marijnh commented Feb 27, 2018

Okay, that sounds like there's no real problem. I've released 5.5.0

@jdalton
Copy link
Contributor

jdalton commented Feb 28, 2018

Is there a way to disable this and defer to the given engine?

Update:

Maybe nooping pp.validateRegExpFlags and pp.validateRegExpPattern?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants