Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use exponential search to speed up lexico partition #585

Merged
merged 1 commit into from
Jul 25, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Jul 21, 2021

Which issue does this PR close?

Closes #586

Rationale for this change

benchmark:

lexicographical_partition_ranges(u8) 2^10
                        time:   [13.430 us 13.624 us 13.846 us]
                        change: [-20.446% -19.617% -18.718%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  9 (9.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

lexicographical_partition_ranges(u8) 2^12
                        time:   [21.706 us 22.377 us 23.029 us]
                        change: [-10.809% -8.7265% -6.6501%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

lexicographical_partition_ranges(u8) 2^10 with nulls
                        time:   [12.534 us 12.701 us 12.869 us]
                        change: [-21.677% -20.203% -18.676%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

lexicographical_partition_ranges(u8) 2^12 with nulls
                        time:   [21.408 us 21.631 us 21.883 us]
                        change: [-9.3607% -7.8667% -6.3528%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

lexicographical_partition_ranges(f64) 2^10
                        time:   [20.639 us 20.846 us 21.084 us]
                        change: [-64.686% -64.138% -63.561%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

lexicographical_partition_ranges(low cardinality) 1024
                        time:   [1.2718 us 1.2830 us 1.2953 us]
                        change: [+7.0358% +8.4813% +9.9578%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

What changes are included in this PR?

adopting an exponential search

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 21, 2021
@jimexist jimexist marked this pull request as draft July 21, 2021 14:12
@jimexist jimexist marked this pull request as ready for review July 21, 2021 14:15
@codecov-commenter
Copy link

Codecov Report

Merging #585 (cafcac6) into master (d8da826) will increase coverage by 0.00%.
The diff coverage is 91.66%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #585   +/-   ##
=======================================
  Coverage   82.46%   82.46%           
=======================================
  Files         167      167           
  Lines       46205    46213    +8     
=======================================
+ Hits        38101    38108    +7     
- Misses       8104     8105    +1     
Impacted Files Coverage Δ
arrow/src/compute/kernels/partition.rs 97.65% <91.66%> (+0.15%) ⬆️
parquet/src/encodings/encoding.rs 94.85% <0.00%> (-0.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8da826...cafcac6. Read the comment docs.

@jimexist jimexist changed the title use exponential search in lexico partition to speed up use exponential search to speed up lexico partition Jul 21, 2021
@jimexist jimexist force-pushed the exponential-search branch from cafcac6 to 91187ed Compare July 24, 2021 02:59
@jimexist jimexist force-pushed the exponential-search branch from 91187ed to d60557c Compare July 24, 2021 03:00
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like a good idea to me. I read the article and read the code carefully and I think it looks good.

Thanks @jimexist

// note here we have right = min(indices.len(), bound + 1) because indices[bound] might
// actually be considered and must be included.
(bound / 2)
+ indices[(bound / 2)..indices.len().min(bound + 1)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if some of the performance improvement also comes from potentially making a smaller sequence -- aka (bound/2) .. len rather than partition_point .. len.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exponential sort can be used to speed up lexico partition kernel
3 participants