Skip to content

Commit

Permalink
compiler: fix lazy DFA false quits on ASCII text
Browse files Browse the repository at this point in the history
One of the things the lazy DFA can't handle is Unicode word boundaries,
since it requires multi-byte look-around. However, it turns out that on
pure ASCII text, Unicode word boundaries are equivalent to ASCII word
boundaries. So the DFA has a heuristic: it treats Unicode word
boundaries as ASCII boundaries until it sees a non-ASCII byte. When it
does, it quits, and some other (slower) regex engine needs to take over.

In a bug report against ripgrep[1], it was discovered that the lazy DFA
was quitting and falling back to a slower engine even though the
haystack was pure ASCII.

It turned out that our equivalence byte class optimization was at fault.
Namely, a '{' (which appears very frequently in the input) was being
grouped in with other non-ASCII bytes. So whenever the DFA saw it, it
treated it as a non-ASCII byte and thus stopped.

The fix for this is simple: when we see a Unicode word boundary in the
compiler, we set a boundary on our byte classes such that ASCII bytes
are guaranteed to be in a different class from non-ASCII bytes. And
indeed, this fixes the performance problem reported in [1].

[1] - BurntSushi/ripgrep#1860
  • Loading branch information
BurntSushi committed May 1, 2021
1 parent 374c168 commit 036ce80
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions src/compile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,13 @@ impl Compiler {
}
self.compiled.has_unicode_word_boundary = true;
self.byte_classes.set_word_boundary();
// We also make sure that all ASCII bytes are in a different
// class from non-ASCII bytes. Otherwise, it's possible for
// ASCII bytes to get lumped into the same class as non-ASCII
// bytes. This in turn may cause the lazy DFA to falsely start
// when it sees an ASCII byte that maps to a byte class with
// non-ASCII bytes. This ensures that never happens.
self.byte_classes.set_range(0, 0x7F);
self.c_empty_look(prog::EmptyLook::WordBoundary)
}
WordBoundary(hir::WordBoundary::UnicodeNegate) => {
Expand All @@ -330,6 +337,8 @@ impl Compiler {
}
self.compiled.has_unicode_word_boundary = true;
self.byte_classes.set_word_boundary();
// See comments above for why we set the ASCII range here.
self.byte_classes.set_range(0, 0x7F);
self.c_empty_look(prog::EmptyLook::NotWordBoundary)
}
WordBoundary(hir::WordBoundary::Ascii) => {
Expand Down

0 comments on commit 036ce80

Please sign in to comment.