-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES2018 RegExp enhancements #673
Conversation
This is great. One thing I'm a bit wary of is the code weight. This is written in a very, err, software-engineery way. The RegExp grammar isn't all that complicated, and I figure if we cut down some of the abstraction (numeric literals rather than constants, etc) this could be at least two times smaller. If you want i can take a stab at that. Also, it may, though it makes little sense from a functionality perspective, be a good idea to make these parser methods instead of plain functions, so that plugins can add RegeExp features. |
Thank you. I agree, it's not slim. There are two reasons.
I will move the functions into the validator. |
https://github.com/jviereck/regjsparser and its tests might be a useful resource for this. |
unicode → this.switchU namedGroups → this.switchN This makes easy to enhance the validator by plugins
Plugins can not extend this class, though (I mean, I guess they sort of can, but not in a composable way). I was proposing for the methods that might be overridden to live on the |
I see. But I'm afraid it since the validator has a ton of state and methods to validate RegExp patterns. I'm not sure if the merging is better... For enhancement, is |
# Conflicts: # bin/run_test262.js
Not really. You should still keep the regexp-related state in a separate object, but put the main parsing functions on |
OK, I will try it. |
I moved the ton of methods into |
src/regexp.js
Outdated
} | ||
|
||
// Node.js 0.12/0.10 don't support String.prototype.codePointAt(). | ||
codePointAt(i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where it actually matters whether we have a single (non-syntax) character or two of them? Could we make do by just looking at code units not code points?
(If not, there may be a problem where this gets regexps without u
flag wrong, since those should treat surrogate pairs as two characters. If yes, we could maybe drop the extra complexity around code points?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for good catch. It was a problem. I fixed it and added tests.
I noticed this code isn't bullet proof and missing some validation. You can parse |
@jonnymanok Thank you for the comment. I added a test for |
Thanks for implementing this, @mysticatea, and thanks for reviewing, @marijnh! Running the regjsparser test suite on acorn, I found four failures:
|
Re: 2, that looks like a mistake to me. cc @jviereck |
@adrianheine |
I tried to figure out what you are referring to (is there a test in regjsparse that tests for this sequence) but couldn't find/figure it out. Happy to take a look once I know what you refer to. Thanks :) |
@jviereck line 1254 in test-data.json for example @mysticatea Right, regjsparser doesn't follow Annex B, I actually discussed that in jviereck/regjsparser#90 but forgot about it again. |
About |
When I implemented regexp.js (where regjsparse.js emerged from) and I imported the test262 tests, I recall there were some RegExps matching XML that were failing. It looked like these tests in test262 should fail from reading the spec, but they did not and also browsers were happy. So, I recall adjusting the parser to make things work. |
@marijnh Do you have a plan to publish a new version? |
Do you have time to look into the performance regression reported here first? That seems a lot for just parsing regexps, which should be only a small fraction of the source code by size—maybe if you profile something obvious will stand out. |
@marijnh I did some research regarding the performance loss and found out that the performance loss is only existing for libraries that rely on heavy use of regular expressions such as jQuery and Angular. So the perf loss isn't existing for libs that uses a small amount of regexp. And also parsing out a single regexp or two does not generate any perf loss. Hope that help :) I may think the issue or the bottle neck is where you create a new regexp prototype instance when validation regular expressions. Maybe try to cache this part somehow or something. Invoking the new keyword frequently can and will create perf loss. V8 - as far as I know - optimize for this differently. |
Looking through the source code for jquery I don't see all that many regular expressions, so I don't understand why validating them would produce such a noticeable slowdown.
This was done before the recent patches as well, and is required by the ESTree spec, so that's not the source of the regression and not something we can change. |
@marijnh I looked into the regexp validation code. Not sure how to reproduce it, but you can use I also noticed in top of the script it reparse for some regular expressions. As far as I understand this code was modeled after V8, so I did a comparison and there is no reparsing there. Not sure if this is enough to do a perf regression. It also seems to - (not 100% sure) that there are a few "code validation duplication" when parsing out unicode e.g. surrogate pairs. |
In my env (Node 8.9.3, Windows 10), performance impact looks small. "use strict"
const fs = require("fs")
const {performance} = require("perf_hooks")
const {parse} = require("./dist/acorn")
const code = fs.readFileSync("jquery-3.3.1.js", "utf8")
const times = []
for (let i = 0; i < 0x2FF; ++i) {
const t0 = performance.now()
parse(code, {ecmaVersion: 8})
const t1 = performance.now()
process.stdout.write(".")
times.push(t1 - t0)
}
times.sort()
console.log("\nMEAN:", times[times.length / 2 | 0], "ms") Before: |
@mysticatea you are only testing jquery. Run acorn own benchmark and you will notice it. Also try benchmarking librares such as angular 1.6, jquery mobile, ts etc. Larger the libs are and more regular expressions, the perf drop compared to current acorn version. Note that acorn benchmark are run in the browser, but i also tested it against nodejs 4.x, 8.x and latest. |
Also I look @jonnymanok Where is the benchmark script? I'm looking for |
I tried with jquery and jquery-mobile: Before: "use strict"
const fs = require("fs")
const {performance} = require("perf_hooks")
const {parse} = require("../dist/acorn")
const code = [
fs.readFileSync("jquery-3.3.1.js", "utf8"),
fs.readFileSync("jquery.mobile-git.js", "utf8")
].join("\n")
const times = []
for (let i = 0; i < 0x3FF; ++i) {
const t0 = performance.now()
parse(code, {ecmaVersion: 8})
const t1 = performance.now()
process.stdout.write(".")
times.push(t1 - t0)
}
times.sort()
console.log("\nMEAN:", times[times.length / 2 | 0], "ms") |
@jonnymanok Thanks. But how can I run it in my local with specific commit? |
@mysticatea Compile it, and modify the source to run your locale copy. After DL the source from this repo. Here is my current result on my computer. Please note that the results are different from computer to computer. And I ran default libs already existing in the benchmark now. I didn't add any others. The React.js is the worse candidate in this library collection :) |
Ah, I find it in |
I didn't find performance regression... 🤔 |
I ran out of time in my end, but I notice you didn't find much regression. Try to add Angular 1.6, typescript and jquery mobile. And also note that the results differ from computer to computer. |
Nothing obvious in current code that cause a huge perf loss as I could see. It's room for optimization and performance tweaks, but that's it. I tested this against jslint. Line 423 contain a very long regexp you can validate against. |
Okay, that sounds like there's no real problem. I've released 5.5.0 |
Is there a way to disable this and defer to the given engine? Update: Maybe nooping |
RegExp
validation by own validator. The validator satisfies the full-spec regular expressions based on A.8 Regular Expressions and B.1.4 Regular Expressions Patterns. The validator generates the same error messages as Node.js nativeRegExp
implementation. (I checked the code coverage of the validator by nyc and made almost 100%)I'd like to get advice for the direction.