automata: fix incorrect offsets reported by reverse inner optimization #1063

BurntSushi · 2023-08-05T21:38:26Z

Sadly it seems that my days of squashing optimization bugs are still before me. In this particular case, the reverse inner literal optimization (which is a new optimization introduced in regex 1.9) resulted in reporting incorrect match offsets in some cases. The offending case here is:

$ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
0:1:9:888:77:66

The above reports a match at 1..9, but the correct match is 0..9. The problem here is that the reverse inner literal optimization is being applied, which splits the regex into three (conceptual) pieces:

(?:(\d+)[:.])?(\d{1,2})
[:.]
(\d{2})

The reverse inner optimization works by looking for occurrences of (2) first, then matching (1) in reverse to find the start position of the match and then searching for (3) in the forward direction to find the end of the match.

The problem in this particular case is that (2) matches at position 3 in the 888:77:66 haystack. Since the first section of numbers is optional, the reverse inner optimization believes a match exists at offset 1 by virtue of matching (1) in reverse. That is, the (\d{1,2}) matches at 1..3 while the (?:(\d+)[:.])? doesn't match at all. The reverse search here is correct in isolation, but it leads to an overall incorrect result by stopping the search early. The issue is that the true leftmost match requires (2) to match at 6..7, but since it matched at 3..4 first, it is considered first and leads to an incorrect overall match.

To fix this, we add another "trip wire" to the reverse inner optimization (of which there are already several) that tries to detect cases where it cannot prove that the match it found is actually the leftmost match. Namely, if it reports a match offset greater than the start of the search and otherwise could have kept searching, then we don't know whether we have the true leftmost match. In that case, we bail on the optimization and let a slower path take over.

This is yet another example of how the nature of regex searching, and in particular leftmost searching, inhibits the composition of different regex strategies. Or at least, makes them incredibly subtle.

Fixes #1060

Sadly it seems that my days of squashing optimization bugs are still before me. In this particular case, the reverse inner literal optimization (which is a new optimization introduced in regex 1.9) resulted in reporting incorrect match offsets in some cases. The offending case here is: $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66' 0:1:9:888:77:66 The above reports a match at 1..9, but the correct match is 0..9. The problem here is that the reverse inner literal optimization is being applied, which splits the regex into three (conceptual) pieces: 1. `(?:(\d+)[:.])?(\d{1,2})` 2. `[:.]` 3. `(\d{2})` The reverse inner optimization works by looking for occurrences of (2) first, then matching (1) in reverse to find the start position of the match and then searching for (3) in the forward direction to find the end of the match. The problem in this particular case is that (2) matches at position `3` in the `888:77:66` haystack. Since the first section of numbers is optional, the reverse inner optimization believes a match exists at offset `1` by virtue of matching (1) in reverse. That is, the `(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at all. The reverse search here is correct in isolation, but it leads to an overall incorrect result by stopping the search early. The issue is that the true leftmost match requires (2) to match at 6..7, but since it matched at 3..4 first, it is considered first and leads to an incorrect overall match. To fix this, we add another "trip wire" to the reverse inner optimization (of which there are already several) that tries to detect cases where it cannot prove that the match it found is actually the leftmost match. Namely, if it reports a match offset greater than the start of the search and otherwise *could* have kept searching, then we don't know whether we have the true leftmost match. In that case, we bail on the optimization and let a slower path take over. This is yet another example of how the nature of regex searching, and in particular leftmost searching, inhibits the composition of different regex strategies. Or at least, makes them incredibly subtle. Fixes #1060

BurntSushi · 2023-08-05T22:16:01Z

I ran this under rebar's curated benchmark suite, and there were no obvious regressions:

benchmark                                       engine      tmp/regex1.9.2_curated.csv  tmp/regexpr_curated.csv
---------                                       ------      --------------------------  -----------------------
curated/01-literal/sherlock-en                  rust/regex  34.1 GB/s (1.00x)           33.4 GB/s (1.02x)
curated/01-literal/sherlock-casei-en            rust/regex  11.4 GB/s (1.01x)           11.6 GB/s (1.00x)
curated/01-literal/sherlock-ru                  rust/regex  32.8 GB/s (1.01x)           33.1 GB/s (1.00x)
curated/01-literal/sherlock-casei-ru            rust/regex  9.0 GB/s (1.00x)            8.4 GB/s (1.07x)
curated/01-literal/sherlock-zh                  rust/regex  38.1 GB/s (1.02x)           38.7 GB/s (1.00x)
curated/02-literal-alternate/sherlock-en        rust/regex  11.7 GB/s (1.09x)           12.8 GB/s (1.00x)
curated/02-literal-alternate/sherlock-casei-en  rust/regex  3.0 GB/s (1.00x)            3.0 GB/s (1.00x)
curated/02-literal-alternate/sherlock-ru        rust/regex  6.6 GB/s (1.01x)            6.7 GB/s (1.00x)
curated/02-literal-alternate/sherlock-casei-ru  rust/regex  1544.1 MB/s (1.00x)         1544.1 MB/s (1.00x)
curated/02-literal-alternate/sherlock-zh        rust/regex  15.2 GB/s (1.00x)           15.1 GB/s (1.00x)
curated/03-date/ascii                           rust/regex  163.3 MB/s (1.01x)          164.3 MB/s (1.00x)
curated/03-date/unicode                         rust/regex  162.2 MB/s (1.01x)          163.3 MB/s (1.00x)
curated/03-date/compile-ascii                   rust/regex  1.57ms (1.03x)              1.52ms (1.00x)
curated/03-date/compile-unicode                 rust/regex  5.47ms (1.00x)              5.47ms (1.00x)
curated/04-ruff-noqa/real                       rust/regex  1667.1 MB/s (1.01x)         1676.1 MB/s (1.00x)
curated/04-ruff-noqa/tweaked                    rust/regex  1577.2 MB/s (1.00x)         1560.6 MB/s (1.01x)
curated/04-ruff-noqa/compile-real               rust/regex  55.44us (1.00x)             56.52us (1.02x)
curated/05-lexer-veryl/single                   rust/regex  9.1 MB/s (1.00x)            9.1 MB/s (1.00x)
curated/05-lexer-veryl/compile-single           rust/regex  276.91us (1.00x)            280.81us (1.01x)
curated/05-lexer-veryl/multi                    rust/regex  95.7 MB/s (1.00x)           95.1 MB/s (1.01x)
curated/06-cloud-flare-redos/original           rust/regex  583.1 MB/s (1.00x)          579.8 MB/s (1.01x)
curated/06-cloud-flare-redos/simplified-short   rust/regex  1835.4 MB/s (1.13x)         2.0 GB/s (1.00x)
curated/06-cloud-flare-redos/simplified-long    rust/regex  83.9 GB/s (1.06x)           88.7 GB/s (1.00x)
curated/07-unicode-character-data/parse-line    rust/regex  355.8 MB/s (1.00x)          356.5 MB/s (1.00x)
curated/07-unicode-character-data/compile       rust/regex  28.44us (1.00x)             28.66us (1.01x)
curated/08-words/all-english                    rust/regex  119.3 MB/s (1.00x)          119.1 MB/s (1.00x)
curated/08-words/all-russian                    rust/regex  19.4 MB/s (1.00x)           19.4 MB/s (1.00x)
curated/08-words/long-english                   rust/regex  802.4 MB/s (1.00x)          802.5 MB/s (1.00x)
curated/08-words/long-russian                   rust/regex  35.1 MB/s (1.00x)           35.1 MB/s (1.00x)
curated/09-aws-keys/full                        rust/regex  1783.1 MB/s (1.00x)         1784.1 MB/s (1.00x)
curated/09-aws-keys/quick                       rust/regex  1872.5 MB/s (1.00x)         1868.0 MB/s (1.00x)
curated/09-aws-keys/compile-full                rust/regex  84.29us (1.00x)             84.96us (1.01x)
curated/09-aws-keys/compile-quick               rust/regex  14.81us (1.00x)             15.78us (1.07x)
curated/10-bounded-repeat/letters-en            rust/regex  725.4 MB/s (1.00x)          720.5 MB/s (1.01x)
curated/10-bounded-repeat/letters-ru            rust/regex  661.0 MB/s (1.00x)          663.9 MB/s (1.00x)
curated/10-bounded-repeat/context               rust/regex  95.5 MB/s (1.02x)           97.2 MB/s (1.00x)
curated/10-bounded-repeat/capitals              rust/regex  824.6 MB/s (1.00x)          825.6 MB/s (1.00x)
curated/10-bounded-repeat/compile-context       rust/regex  59.59us (1.00x)             59.46us (1.00x)
curated/10-bounded-repeat/compile-capitals      rust/regex  59.31us (1.00x)             59.22us (1.00x)
curated/11-unstructured-to-json/extract         rust/regex  112.4 MB/s (1.00x)          112.3 MB/s (1.00x)
curated/11-unstructured-to-json/compile         rust/regex  19.87us (1.00x)             20.25us (1.02x)
curated/12-dictionary/single                    rust/regex  718.1 MB/s (1.00x)          714.8 MB/s (1.00x)
curated/12-dictionary/multi                     rust/regex  173.0 MB/s (1.00x)          173.2 MB/s (1.00x)
curated/12-dictionary/compile-single            rust/regex  7.68ms (1.00x)              7.81ms (1.02x)
curated/12-dictionary/compile-multi             rust/regex  15.91ms (1.00x)             15.89ms (1.00x)
curated/13-noseyparker/single                   rust/regex  126.4 MB/s (1.01x)          127.2 MB/s (1.00x)
curated/13-noseyparker/multi                    rust/regex  100.5 MB/s (1.00x)          100.3 MB/s (1.00x)
curated/13-noseyparker/compile-single           rust/regex  2.43ms (1.00x)              2.43ms (1.00x)
curated/13-noseyparker/compile-multi            rust/regex  2.84ms (1.01x)              2.80ms (1.00x)
curated/14-quadratic/1x                         rust/regex  18.0 MB/s (1.00x)           18.0 MB/s (1.00x)
curated/14-quadratic/2x                         rust/regex  8.6 MB/s (1.00x)            8.6 MB/s (1.00x)
curated/14-quadratic/10x                        rust/regex  1708.2 KB/s (1.00x)         1708.4 KB/s (1.00x)

BurntSushi merged commit 73f7889 into master Aug 5, 2023

BurntSushi deleted the ag/fix-1060 branch August 5, 2023 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automata: fix incorrect offsets reported by reverse inner optimization #1063

automata: fix incorrect offsets reported by reverse inner optimization #1063

BurntSushi commented Aug 5, 2023

BurntSushi commented Aug 5, 2023

automata: fix incorrect offsets reported by reverse inner optimization #1063

automata: fix incorrect offsets reported by reverse inner optimization #1063

Conversation

BurntSushi commented Aug 5, 2023

BurntSushi commented Aug 5, 2023