-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower search with combination word boundary and multiple regex on certain files #1860
Comments
One of the things the lazy DFA can't handle is Unicode word boundaries, since it requires multi-byte look-around. However, it turns out that on pure ASCII text, Unicode word boundaries are equivalent to ASCII word boundaries. So the DFA has a heuristic: it treats Unicode word boundaries as ASCII boundaries until it sees a non-ASCII byte. When it does, it quits, and some other (slower) regex engine needs to take over. In a bug report against ripgrep[1], it was discovered that the lazy DFA was quitting and falling back to a slower engine even though the haystack was pure ASCII. It turned out that our equivalence byte class optimization was at fault. Namely, a '{' (which appears very frequently in the input) was being grouped in with other non-ASCII bytes. So whenever the DFA saw it, it treated it as a non-ASCII byte and thus stopped. The fix for this is simple: when we see a Unicode word boundary in the compiler, we set a boundary on our byte classes such that ASCII bytes are guaranteed to be in a different class from non-ASCII bytes. And indeed, this fixes the performance problem reported in [1]. [1] - BurntSushi/ripgrep#1860
One of the things the lazy DFA can't handle is Unicode word boundaries, since it requires multi-byte look-around. However, it turns out that on pure ASCII text, Unicode word boundaries are equivalent to ASCII word boundaries. So the DFA has a heuristic: it treats Unicode word boundaries as ASCII boundaries until it sees a non-ASCII byte. When it does, it quits, and some other (slower) regex engine needs to take over. In a bug report against ripgrep[1], it was discovered that the lazy DFA was quitting and falling back to a slower engine even though the haystack was pure ASCII. It turned out that our equivalence byte class optimization was at fault. Namely, a '{' (which appears very frequently in the input) was being grouped in with other non-ASCII bytes. So whenever the DFA saw it, it treated it as a non-ASCII byte and thus stopped. The fix for this is simple: when we see a Unicode word boundary in the compiler, we set a boundary on our byte classes such that ASCII bytes are guaranteed to be in a different class from non-ASCII bytes. And indeed, this fixes the performance problem reported in [1]. [1] - BurntSushi/ripgrep#1860
This is the definition of a perfect bug report. Thank you so much. And thank you for reporting it. This was indeed quite a subtle performance bug in the regex engine. The actual fix is here: rust-lang/regex#768 The higher level view is that there is a space (and also time) saving optimization in the faster lazy DFA engine that groups together bytes into equivalence classes that can't otherwise discriminate between a match and a non-match. That was actually correct here, but the grouping didn't account for bytes that cause the DFA to quit early. In particular, your JSON file has many The fix for this was to simply mark the ASCII range as distinct from the non-ASCII range when constructing the equivalence class boundaries (if a Unicode word boundary is seen). Another interesting way to look at this, is if you try adding the Thanks again for the great bug report! |
Thank you for the explanation and quick turn-around for a fix! It sounds like there are some really neat things happening in the rust regex library. Ripgrep has become an indispensable part of my toolkit so you have my gratitude and admiration for all your work on the project. |
What version of ripgrep are you using?
How did you install ripgrep?
What operating system are you using ripgrep on?
Ripgrep is inside a docker container.
Host: Pop!_OS 20.04 (Ubuntu 20.04 variant)
Container: Ubuntu 21.04 Hirsute
Kernel:
Describe your bug.
This may be a duplicate or related to #1760. I am getting much slower execution time when using word boundaries
\b
in a regex. The search performance is:\b
search pattern.\b
search pattern plus an additional pattern.\b
search pattern plus an additional pattern when--no-unicode
is used.\b
search pattern plus an additional pattern on a differently formatted file (TSV vs JSON).What are the steps to reproduce the behavior?
Here are the two files which can be used to reproduce the issue.
conn.json.log
conn.tsv.log
These are Zeek conn log files that contain the same information but in different formats.
conn.tsv.log
is tab-separated and was converted to JSON to generateconn.json.log
. The files I uploaded are thehead -1000
of two larger files. I can provide the larger logs if they would be useful.The
time
command differences are very small with the files I uploaded, but I'll also provide hyperfine output for both the full logs and the ones I uploaded.What is the actual behavior?
The following timings are on the larger logs.
When I use a regex for an IP address, the search time is very small (<1s).
When I add an additional regex, the search time is long (>30sec).
If I add
--no-unicode
, the search time drops back down (<1s).This next part puzzles me even more. Recall that the TSV file was used to generate the JSON file. I know these aren't the same and are quite different file sizes. But it's not like a bunch of non-ASCII characters got added to one file and not the other. If I run the command from the JSON file that caused the long search time on the TSV, the search time is short (<1s).
Here are hyperfine results that compare the last three commands, along with using
--no-unicode
on the TSV file for good measure.And here are hyperfine results with nearly the same commands on the smaller files provided, just to show that the same differences can be seen.
Here are several debug outputs.
Fast (one search pattern):
Slow (two search patterns):
Fast (two search patterns, --no-unicode):
Fast (two search patterns, TSV):
Fast (two search patterns, no word boundaries):
What is the expected behavior?
Ideally, I'd like the slow search to take roughly the same time as the other fast searches.
I understand from #1760 that word boundaries with Unicode is a slow process. But if that's the case here, I wonder why the same search on a different file can be so much faster?
The text was updated successfully, but these errors were encountered: