Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(pattern): Replace regex based matching #4073

Merged
merged 16 commits into from
Oct 2, 2024
Merged

ref(pattern): Replace regex based matching #4073

merged 16 commits into from
Oct 2, 2024

Conversation

Dav1dde
Copy link
Member

@Dav1dde Dav1dde commented Sep 24, 2024

Replaces the regex fallback for complex patterns with a matching strategy using the parsed tokens.

The algorithm is loosely based on the one described by Kirk J Krauss. With some modifications, for example it does not use two loops and has support for character classes as well as alternations.

The implementation beats the current one of regex_lite in the added "complex" benchmark:

complex_match/case_sensitive
                        time:   [106.17 ns 106.28 ns 106.40 ns]
                        change: [-97.290% -97.284% -97.277%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
complex_match/case_insensitive
                        time:   [858.73 ns 859.23 ns 859.82 ns]
                        change: [-80.108% -80.027% -79.948%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
  1 (1.00%) high mild
  22 (22.00%) high severe

When comparing it with the regex crate, it's 2x slower for case sensitive and 8x slower for case insensitive.

There are still a lot of performance improvements to be found, some of them are annotated in code and more can probably easily found by running a profiler. Especially the case insensitive code paths are very expensive.
But this is something for future PRs.

The code is fuzzed with:

$ cargo +nightly fuzz run is_match_case_sensitive
$ cargo +nightly fuzz run is_match_case_insensitive

With full (expected) coverage in wildmatch (except for the impossible condition(s), like nested alternates).

@Dav1dde Dav1dde force-pushed the dav1d/wildmatch branch 8 times, most recently from 96ca156 to 177e7a0 Compare September 26, 2024 09:36
@Dav1dde Dav1dde self-assigned this Sep 26, 2024
@Dav1dde Dav1dde marked this pull request as ready for review September 26, 2024 09:37
@Dav1dde Dav1dde requested a review from a team as a code owner September 26, 2024 09:37
Copy link
Contributor

@loewenheim loewenheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. The very thorough comments definitely help.

relay-pattern/src/lib.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
struct CaseInsensitive;

impl Matcher for CaseInsensitive {
#[inline(always)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, does #[inline(always)] actually bring a performance improvement in the places where you added it?

https://std-dev-guide.rust-lang.org/policy/inline.html#what-about-inlinealways

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't and usually I would just rely on the compiler (which is still allowed to ignore a inline(always)). I added it here to hopefully force a more aggressive monomorphization which hopefully leads to a better codegen. It's specifically also added on internal helpers.
A lot of 'hopefully's in there, but it should (another Konjunktiv) only make it better not worse.

relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
relay-pattern/src/lib.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
// TODO: implement manual lowercase which remembers if there were 'special' unicode
// conversion involved, if not, there is no recovery necessary.
// TODO: benchmark if a lut from offset -> original offset makes sense.
// TODO: benchmark allocation free and search with proper case insensitive search.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we could implement a special is_prefix_ignore_case ourselves?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I essentially conceded to this implementation for now, because it was the easiest to get working. This is definitely a place where we potentially can get a lot of performance with a better implementation. A better impl here will also benefit is_prefix (it's essentially the same code).

relay-pattern/src/wildmatch.rs Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
relay-pattern/src/wildmatch.rs Outdated Show resolved Hide resolved
@Dav1dde Dav1dde requested a review from jjbayer October 1, 2024 09:25
@Dav1dde Dav1dde merged commit 4fc8ee8 into master Oct 2, 2024
23 checks passed
@Dav1dde Dav1dde deleted the dav1d/wildmatch branch October 2, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants