-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up advancing within a block, take 2. #13958
Conversation
PR apache#13692 tried to speed up advancing by using branchless binary search, but while this yielded a speedup on my machine, this yielded a slowdown on nightly benchmarks. This PR tries a different approach using vectorization. Experimentation suggests that it slows down a bit queries when advancing often goes to the very next doc ID, such as term queries and `OrHighNotXXX` tasks. But it speeds up queries that advance to the next few doc IDs, such as `AndHighHigh`. I think that this is a good trade-off since it slows down some plenty fast queries in exchange for a speedup with some more expensive queries. Here is a `luceneutil` run on `wikibigall` with `-searchConcurrency 0`: ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighNotHigh 302.78 (2.4%) 283.75 (2.9%) -6.3% ( -11% - -1%) 0.000 OrHighNotMed 384.69 (3.0%) 363.33 (2.8%) -5.6% ( -10% - 0%) 0.000 MedTerm 564.86 (2.2%) 537.04 (3.5%) -4.9% ( -10% - 0%) 0.000 LowTerm 1014.02 (2.2%) 967.37 (3.6%) -4.6% ( -10% - 1%) 0.000 OrHighNotLow 446.38 (3.4%) 427.10 (3.3%) -4.3% ( -10% - 2%) 0.000 HighTerm 485.41 (1.9%) 464.49 (3.2%) -4.3% ( -9% - 0%) 0.000 OrNotHighHigh 229.78 (2.4%) 221.51 (3.1%) -3.6% ( -8% - 1%) 0.000 OrNotHighMed 396.63 (2.7%) 382.41 (3.1%) -3.6% ( -9% - 2%) 0.000 Prefix3 145.65 (3.6%) 142.39 (3.7%) -2.2% ( -9% - 5%) 0.051 IntNRQ 158.04 (4.7%) 154.77 (5.6%) -2.1% ( -11% - 8%) 0.205 CountTerm 8320.96 (3.2%) 8198.56 (4.7%) -1.5% ( -9% - 6%) 0.246 PKLookup 273.35 (3.6%) 269.71 (5.2%) -1.3% ( -9% - 7%) 0.345 Wildcard 83.30 (3.4%) 82.28 (3.1%) -1.2% ( -7% - 5%) 0.234 HighTermMonthSort 3235.98 (3.1%) 3198.04 (2.9%) -1.2% ( -6% - 4%) 0.215 HighTermTitleSort 148.94 (2.5%) 148.38 (2.6%) -0.4% ( -5% - 4%) 0.638 CountOrHighMed 104.51 (2.0%) 104.22 (1.7%) -0.3% ( -3% - 3%) 0.640 HighTermTitleBDVSort 14.67 (5.3%) 14.64 (5.9%) -0.2% ( -10% - 11%) 0.899 AndStopWords 30.68 (3.0%) 30.66 (2.7%) -0.1% ( -5% - 5%) 0.941 CountOrHighHigh 50.17 (2.0%) 50.19 (1.9%) 0.0% ( -3% - 3%) 0.947 OrHighRare 273.82 (4.5%) 273.96 (3.8%) 0.0% ( -7% - 8%) 0.971 TermDTSort 353.37 (6.4%) 354.23 (6.7%) 0.2% ( -12% - 14%) 0.907 Fuzzy1 77.85 (2.6%) 78.12 (2.0%) 0.3% ( -4% - 4%) 0.633 Fuzzy2 73.23 (2.5%) 73.50 (1.9%) 0.4% ( -3% - 4%) 0.594 HighTermDayOfYearSort 836.62 (3.1%) 841.07 (4.0%) 0.5% ( -6% - 7%) 0.639 And2Terms2StopWords 154.49 (1.8%) 155.41 (2.1%) 0.6% ( -3% - 4%) 0.340 OrHighLow 771.90 (2.0%) 778.20 (2.2%) 0.8% ( -3% - 5%) 0.217 And3Terms 167.63 (2.3%) 169.23 (2.2%) 1.0% ( -3% - 5%) 0.176 OrStopWords 33.99 (4.6%) 34.39 (4.1%) 1.2% ( -7% - 10%) 0.388 CountAndHighMed 148.01 (2.4%) 149.91 (1.0%) 1.3% ( -2% - 4%) 0.025 Or2Terms2StopWords 156.93 (2.8%) 159.21 (3.0%) 1.5% ( -4% - 7%) 0.117 AndHighHigh 67.06 (1.3%) 68.07 (1.6%) 1.5% ( -1% - 4%) 0.001 OrMany 18.67 (2.9%) 18.96 (2.9%) 1.5% ( -4% - 7%) 0.089 AndHighMed 185.02 (1.6%) 189.06 (1.3%) 2.2% ( 0% - 5%) 0.000 AndHighLow 948.34 (2.6%) 970.47 (2.6%) 2.3% ( -2% - 7%) 0.004 OrHighHigh 68.42 (1.4%) 70.08 (1.3%) 2.4% ( 0% - 5%) 0.000 Or3Terms 166.47 (2.7%) 171.10 (3.1%) 2.8% ( -2% - 8%) 0.003 OrNotHighLow 964.69 (3.1%) 994.46 (3.3%) 3.1% ( -3% - 9%) 0.002 OrHighMed 222.32 (2.1%) 230.93 (1.5%) 3.9% ( 0% - 7%) 0.000 CountAndHighHigh 48.88 (2.4%) 52.87 (1.3%) 8.2% ( 4% - 12%) 0.000 ```
Specializing
|
And I seem to be getting a better speedup by using
|
I ran this PR on my Mac laptop (M3), where this gives a massive slowdown, I imagine because some of the vector operations I'm using are emulated. I need to find what to check against in order to avoid this like we did for vectors with |
you are using VectorMask, only use this where implemented in HW (AVX-512 and ARM SVE). |
For these uses of vectormask you are ok with AVX2 (so just use existing FAST_INTEGER_VECTORS check): https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1597-L1603 So if you want to add this one without slowdowns: i would check: |
maybe its a bug that it doesnt work on your mac either. because elsewhere they have code that looks like it is supposed to be doing this stuff: https://github.com/openjdk/jdk/blob/f1a9a8d25b2e1f9b5dbe8719abb66ec4cd9057dc/src/hotspot/cpu/aarch64/aarch64_vector_ad.m4#L3782 |
I did more digging: vectorization actually worked on my Mac! So my best guess is that I got a ~20% slowdown because I only have 2 lanes on it, so the For now I disabled the optimization on machines which have less than 4 lanes, I'll try to run benchmarks on more CPUs to confirm it's not only helpful on my desktop CPU (AMD Ryzen 9 3900X). |
Here's a
|
Here's wikimediumall on a c7i.2xlarge instance that supports AVX512:
|
I plan on merging this change soon, and looking into moving postings back to int[] arrays next to hopefully get benefits from having 2x more lanes that can be compared at once. |
Nightly benchmarks just picked up the change with a mix of speedups and slowdowns: https://benchmarks.mikemccandless.com/2024.10.30.18.12.23.html. Here are the main ones I'm seeing: Speedups:
Slowdowns:
I'm a bit surprised/disappointed at the |
If you check out data at #13692 (comment), |
PR #13692 tried to speed up advancing by using branchless binary search, but while this yielded a speedup on my machine, this yielded a slowdown on nightly benchmarks. This PR tries a different approach using vectorization. Experimentation suggests that it speeds up queries that advance to the next few doc IDs, such as `AndHighHigh`.
PR #13692 tried to speed up advancing by using branchless binary search, but while this yielded a speedup on my machine, this yielded a slowdown on nightly benchmarks.
This PR tries a different approach using vectorization. Experimentation suggests that it slows down a bit queries when advancing often goes to the very next doc ID, such as term queries and
OrHighNotXXX
tasks. But it speeds up queries that advance to the next few doc IDs, such asAndHighHigh
. I think that this is a good trade-off since it slows down some plenty fast queries in exchange for a speedup with some more expensive queries.Here is a
luceneutil
run onwikibigall
with-searchConcurrency 0
: