Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IContains for case insensitiveILIKE: improve performance for "contains" where the needle is ASCII #6131

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

samuelcolvin
Copy link
Contributor

@samuelcolvin samuelcolvin commented Jul 26, 2024

DO NOT MERGE

This is superseded by #6145, even if we keep this, we should special case the method defined here to only be used for small haystacks.


Which issue does this PR close?

This continues work from #6118 and #6128 and will require them to be merged before it's fixed, it's the final part of #6107 I think.

Rationale for this change

Lots of context in #6107, this makes ILIKE queries which are simply "contains" significantly faster.

Benchmarks:

➤ cargo bench -p arrow --bench comparison_kernels -F test_utils -- 'ilike.*contains'
   Compiling arrow-string v52.2.0 (/Users/samuel/code/arrow-rs/arrow-string)
   Compiling arrow v52.2.0 (/Users/samuel/code/arrow-rs/arrow)
    Finished `bench` profile [optimized] target(s) in 3.99s
     Running benches/comparison_kernels.rs (target/release/deps/comparison_kernels-95ab196215ed59e6)
ilike_utf8 scalar contains
                        time:   [125.84 µs 125.89 µs 125.95 µs]
                        change: [-80.870% -80.819% -80.764%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

nilike_utf8 scalar contains
                        time:   [125.79 µs 126.06 µs 126.49 µs]
                        change: [-80.842% -80.809% -80.776%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  12 (12.00%) high severe

Note, in theory the current regex approach could be faster for very large haystacks, but from my experiments that case will be quite rare, and in many cases the regex will be far far slower, see samuelcolvin/quick-strings#1:

prefix-length=10    needle-length=10    — icontains=10.600 ns regex=23.766 ns
prefix-length=100   needle-length=10    — icontains=54.671 ns regex=122.80 ns
prefix-length=1_000 needle-length=10    — icontains=494.05 ns regex=820.21 ns
prefix-length=2_000 needle-length=10    — icontains=1.0769 µs regex=1.1335 µs
prefix-length=5_000 needle-length=10    — icontains=2.6704 µs regex=2.0020 µs
prefix-length=10    needle-length=1_000 — icontains=564.89 ns regex=1.5671 µs
prefix-length=10    needle-length=2_000 — icontains=1.1176 µs regex=11.945 ms
prefix-length=10    needle-length=5_000 — icontains=2.7539 µs regex=74.793 ms

I therefore think it's better to stick to a single implementation than have a branch for very large haystacks.

What changes are included in this PR?

New special case for "case insensitive ascii contains".

Are there any user-facing changes?

No AFAIK,

@samuelcolvin samuelcolvin changed the title Improve ILIKE contains performance Improve ILIKE performance for "contains" style queries where the needle is ASCII Jul 26, 2024
@alamb alamb changed the title Improve ILIKE performance for "contains" style queries where the needle is ASCII Add IContains for case insensitiveILIKE: improve performance for "contains" where the needle is ASCII Jul 27, 2024
@samuelcolvin
Copy link
Contributor Author

Running benches with filter ilike.*contains on GCP:

haystack length 2

group                          icontains-performance                  master
-----                          ---------------------                  ------
ilike_utf8 scalar contains     1.00    381.9±0.83µs        ? ?/sec    3.97   1515.2±7.72µs        ? ?/sec
nilike_utf8 scalar contains    1.00    382.2±0.92µs        ? ?/sec    3.96   1514.7±6.07µs        ? ?/sec

haystack length 0..400

group                          icontains-performance                  master
-----                          ---------------------                  ------
ilike_utf8 scalar contains     1.00     22.2±0.08ms        ? ?/sec    1.02     22.7±0.10ms        ? ?/sec
nilike_utf8 scalar contains    1.00     22.2±0.09ms        ? ?/sec    1.02     22.7±0.12ms        ? ?/sec

E.g. this is very good when the haytack length is short (or I guess equivalently when there's a match early in the haystack) and no worse for longer haystacks.

Hopefully long term @BurntSushi implements an ascii case-insensitive finder in memchr as he mentioned in BurntSushi/memchr#156 (comment) and we can replace this with something that's faster in all cases.

@samuelcolvin samuelcolvin marked this pull request as draft July 29, 2024 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant