Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automata: fix incorrect offsets reported by reverse inner optimization #1063

Merged
merged 1 commit into from
Aug 5, 2023

Conversation

BurntSushi
Copy link
Member

Sadly it seems that my days of squashing optimization bugs are still before me. In this particular case, the reverse inner literal optimization (which is a new optimization introduced in regex 1.9) resulted in reporting incorrect match offsets in some cases. The offending case here is:

$ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
0:1:9:888:77:66

The above reports a match at 1..9, but the correct match is 0..9. The problem here is that the reverse inner literal optimization is being applied, which splits the regex into three (conceptual) pieces:

  1. (?:(\d+)[:.])?(\d{1,2})
  2. [:.]
  3. (\d{2})

The reverse inner optimization works by looking for occurrences of (2) first, then matching (1) in reverse to find the start position of the match and then searching for (3) in the forward direction to find the end of the match.

The problem in this particular case is that (2) matches at position 3 in the 888:77:66 haystack. Since the first section of numbers is optional, the reverse inner optimization believes a match exists at offset 1 by virtue of matching (1) in reverse. That is, the (\d{1,2}) matches at 1..3 while the (?:(\d+)[:.])? doesn't match at all. The reverse search here is correct in isolation, but it leads to an overall incorrect result by stopping the search early. The issue is that the true leftmost match requires (2) to match at 6..7, but since it matched at 3..4 first, it is considered first and leads to an incorrect overall match.

To fix this, we add another "trip wire" to the reverse inner optimization (of which there are already several) that tries to detect cases where it cannot prove that the match it found is actually the leftmost match. Namely, if it reports a match offset greater than the start of the search and otherwise could have kept searching, then we don't know whether we have the true leftmost match. In that case, we bail on the optimization and let a slower path take over.

This is yet another example of how the nature of regex searching, and in particular leftmost searching, inhibits the composition of different regex strategies. Or at least, makes them incredibly subtle.

Fixes #1060

Sadly it seems that my days of squashing optimization bugs are still
before me. In this particular case, the reverse inner literal
optimization (which is a new optimization introduced in regex 1.9)
resulted in reporting incorrect match offsets in some cases. The
offending case here is:

    $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66'
    0:1:9:888:77:66

The above reports a match at 1..9, but the correct match is 0..9. The
problem here is that the reverse inner literal optimization is being
applied, which splits the regex into three (conceptual) pieces:

1. `(?:(\d+)[:.])?(\d{1,2})`
2. `[:.]`
3. `(\d{2})`

The reverse inner optimization works by looking for occurrences of (2)
first, then matching (1) in reverse to find the start position of the
match and then searching for (3) in the forward direction to find the
end of the match.

The problem in this particular case is that (2) matches at position `3`
in the `888:77:66` haystack. Since the first section of numbers is
optional, the reverse inner optimization believes a match exists at
offset `1` by virtue of matching (1) in reverse. That is, the
`(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at
all. The reverse search here is correct in isolation, but it leads to an
overall incorrect result by stopping the search early. The issue is that
the true leftmost match requires (2) to match at 6..7, but since it
matched at 3..4 first, it is considered first and leads to an incorrect
overall match.

To fix this, we add another "trip wire" to the reverse inner
optimization (of which there are already several) that tries to detect
cases where it cannot prove that the match it found is actually the
leftmost match. Namely, if it reports a match offset greater than the
start of the search and otherwise *could* have kept searching, then we
don't know whether we have the true leftmost match. In that case, we
bail on the optimization and let a slower path take over.

This is yet another example of how the nature of regex searching, and in
particular leftmost searching, inhibits the composition of different
regex strategies. Or at least, makes them incredibly subtle.

Fixes #1060
@BurntSushi
Copy link
Member Author

I ran this under rebar's curated benchmark suite, and there were no obvious regressions:

benchmark                                       engine      tmp/regex1.9.2_curated.csv  tmp/regexpr_curated.csv
---------                                       ------      --------------------------  -----------------------
curated/01-literal/sherlock-en                  rust/regex  34.1 GB/s (1.00x)           33.4 GB/s (1.02x)
curated/01-literal/sherlock-casei-en            rust/regex  11.4 GB/s (1.01x)           11.6 GB/s (1.00x)
curated/01-literal/sherlock-ru                  rust/regex  32.8 GB/s (1.01x)           33.1 GB/s (1.00x)
curated/01-literal/sherlock-casei-ru            rust/regex  9.0 GB/s (1.00x)            8.4 GB/s (1.07x)
curated/01-literal/sherlock-zh                  rust/regex  38.1 GB/s (1.02x)           38.7 GB/s (1.00x)
curated/02-literal-alternate/sherlock-en        rust/regex  11.7 GB/s (1.09x)           12.8 GB/s (1.00x)
curated/02-literal-alternate/sherlock-casei-en  rust/regex  3.0 GB/s (1.00x)            3.0 GB/s (1.00x)
curated/02-literal-alternate/sherlock-ru        rust/regex  6.6 GB/s (1.01x)            6.7 GB/s (1.00x)
curated/02-literal-alternate/sherlock-casei-ru  rust/regex  1544.1 MB/s (1.00x)         1544.1 MB/s (1.00x)
curated/02-literal-alternate/sherlock-zh        rust/regex  15.2 GB/s (1.00x)           15.1 GB/s (1.00x)
curated/03-date/ascii                           rust/regex  163.3 MB/s (1.01x)          164.3 MB/s (1.00x)
curated/03-date/unicode                         rust/regex  162.2 MB/s (1.01x)          163.3 MB/s (1.00x)
curated/03-date/compile-ascii                   rust/regex  1.57ms (1.03x)              1.52ms (1.00x)
curated/03-date/compile-unicode                 rust/regex  5.47ms (1.00x)              5.47ms (1.00x)
curated/04-ruff-noqa/real                       rust/regex  1667.1 MB/s (1.01x)         1676.1 MB/s (1.00x)
curated/04-ruff-noqa/tweaked                    rust/regex  1577.2 MB/s (1.00x)         1560.6 MB/s (1.01x)
curated/04-ruff-noqa/compile-real               rust/regex  55.44us (1.00x)             56.52us (1.02x)
curated/05-lexer-veryl/single                   rust/regex  9.1 MB/s (1.00x)            9.1 MB/s (1.00x)
curated/05-lexer-veryl/compile-single           rust/regex  276.91us (1.00x)            280.81us (1.01x)
curated/05-lexer-veryl/multi                    rust/regex  95.7 MB/s (1.00x)           95.1 MB/s (1.01x)
curated/06-cloud-flare-redos/original           rust/regex  583.1 MB/s (1.00x)          579.8 MB/s (1.01x)
curated/06-cloud-flare-redos/simplified-short   rust/regex  1835.4 MB/s (1.13x)         2.0 GB/s (1.00x)
curated/06-cloud-flare-redos/simplified-long    rust/regex  83.9 GB/s (1.06x)           88.7 GB/s (1.00x)
curated/07-unicode-character-data/parse-line    rust/regex  355.8 MB/s (1.00x)          356.5 MB/s (1.00x)
curated/07-unicode-character-data/compile       rust/regex  28.44us (1.00x)             28.66us (1.01x)
curated/08-words/all-english                    rust/regex  119.3 MB/s (1.00x)          119.1 MB/s (1.00x)
curated/08-words/all-russian                    rust/regex  19.4 MB/s (1.00x)           19.4 MB/s (1.00x)
curated/08-words/long-english                   rust/regex  802.4 MB/s (1.00x)          802.5 MB/s (1.00x)
curated/08-words/long-russian                   rust/regex  35.1 MB/s (1.00x)           35.1 MB/s (1.00x)
curated/09-aws-keys/full                        rust/regex  1783.1 MB/s (1.00x)         1784.1 MB/s (1.00x)
curated/09-aws-keys/quick                       rust/regex  1872.5 MB/s (1.00x)         1868.0 MB/s (1.00x)
curated/09-aws-keys/compile-full                rust/regex  84.29us (1.00x)             84.96us (1.01x)
curated/09-aws-keys/compile-quick               rust/regex  14.81us (1.00x)             15.78us (1.07x)
curated/10-bounded-repeat/letters-en            rust/regex  725.4 MB/s (1.00x)          720.5 MB/s (1.01x)
curated/10-bounded-repeat/letters-ru            rust/regex  661.0 MB/s (1.00x)          663.9 MB/s (1.00x)
curated/10-bounded-repeat/context               rust/regex  95.5 MB/s (1.02x)           97.2 MB/s (1.00x)
curated/10-bounded-repeat/capitals              rust/regex  824.6 MB/s (1.00x)          825.6 MB/s (1.00x)
curated/10-bounded-repeat/compile-context       rust/regex  59.59us (1.00x)             59.46us (1.00x)
curated/10-bounded-repeat/compile-capitals      rust/regex  59.31us (1.00x)             59.22us (1.00x)
curated/11-unstructured-to-json/extract         rust/regex  112.4 MB/s (1.00x)          112.3 MB/s (1.00x)
curated/11-unstructured-to-json/compile         rust/regex  19.87us (1.00x)             20.25us (1.02x)
curated/12-dictionary/single                    rust/regex  718.1 MB/s (1.00x)          714.8 MB/s (1.00x)
curated/12-dictionary/multi                     rust/regex  173.0 MB/s (1.00x)          173.2 MB/s (1.00x)
curated/12-dictionary/compile-single            rust/regex  7.68ms (1.00x)              7.81ms (1.02x)
curated/12-dictionary/compile-multi             rust/regex  15.91ms (1.00x)             15.89ms (1.00x)
curated/13-noseyparker/single                   rust/regex  126.4 MB/s (1.01x)          127.2 MB/s (1.00x)
curated/13-noseyparker/multi                    rust/regex  100.5 MB/s (1.00x)          100.3 MB/s (1.00x)
curated/13-noseyparker/compile-single           rust/regex  2.43ms (1.00x)              2.43ms (1.00x)
curated/13-noseyparker/compile-multi            rust/regex  2.84ms (1.01x)              2.80ms (1.00x)
curated/14-quadratic/1x                         rust/regex  18.0 MB/s (1.00x)           18.0 MB/s (1.00x)
curated/14-quadratic/2x                         rust/regex  8.6 MB/s (1.00x)            8.6 MB/s (1.00x)
curated/14-quadratic/10x                        rust/regex  1708.2 KB/s (1.00x)         1708.4 KB/s (1.00x)

@BurntSushi BurntSushi merged commit 73f7889 into master Aug 5, 2023
@BurntSushi BurntSushi deleted the ag/fix-1060 branch August 5, 2023 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reverse inner literal optimization results in incorrect match offsets in some cases
1 participant