-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would you be willing to benchmark agains Julia's default sorting algorithm? #2
Comments
Thanks for reaching out! I'm happy to work with you to get accurate benchmarks and I'll see if I can get the Julia code running a little later. Note that SingeliSort is a newer effort that includes RH sort for 4-byte integers. There's some stuff I'd like to add but all the sorts should be correct (not well tested). See the exports at the bottom of compiled sort.c; I don't see an |
Thanks! That bridges the gap, they now trade off depending on input size on this benchmark with this faster for smaller sizes and Julia faster for larger sizes. Sorting mid-sized arrays is a weakness of Julia's sorting algorithms, and I'm glad to see an algorithm that might be faster in that case.
|
Here's a run on a rather slow CPU; can also confirm that Julia 1.9.0 is >2x faster than 1.8.5!
|
The idea with SingeliSort is to use a top-down quicksort, and switch over to Robin Hood when the array gets small enough and a sample indicates that the distribution is even enough (several other base cases cover situations where this doesn't happen). It's intended to eventually be the sorting algorithm in CBQN implementing BQN, although I don't know when that'll happen. There are some logistical issues with using the repository because there are components that should be shared with the rest of the interpreter instead of duplicating them. I just pushed an updated compiled/sort.c so that's on the latest code (although there's hardly any functional change from the last recompile). Should be possible to run benchmarks from a fresh clone like so:
|
I see the runtime spike for rhsort a little later (somewhere between 1e6 and 1e7) so for me rhsort is faster all the way up to 1e6, but I do get results that are pretty similar to that figure.
I expect that it is possible to significantly outperform Julia on massive arrays as Julia uses an LSB radix sort with bad cache locality while a divide and conquer top down approach (MSB-radix sort) followed by LSB radix sort for the base case is better in that case. These benchmarks make me believe that robin sort is plausibly better than radix sort as a base case—though I'd need to do benchmarks on data that is not drawn from a uniform distribution to know for sure. Do you know why rhsort has that runtime spike at higher sizes? |
Since RH on random data does random writes, the spike comes from running out of L3 cache. The M2's L3 is huge so it can sort bigger arrays. 256-way LSB radix (CBQN's current 4-byte sorting algorithm, incidentally) has very good locality: it reads from one region and writes to 256 at a time, which cache can easily keep up with. Unless the spacing causes it to have set associativity issues as described here. I guess I don't know what happens when it stops fitting in RAM but I've never seen an issue with large sizes in radix sort. The issue with using radix sort as a base case is more that it ignores any work that's already been done. SingeliSort does use a 2-byte radix sort when a partition falls into 2-byte range, implemented here. A nice thing about this is that it only needs to keep 2 bytes per element so it's mostly in place. It's a little faster than RH and it'll end up being used on inputs where the range is small enough relative to the length, which will always be true for large enough arrays. |
You may also be interested in CBQN's current radix sort? For me it's coming up substantially faster than Julia, although surely this is somewhat architecture-dependent. Looking at Building should be quick: clone CBQN; run Here are my timings, showing log_10(n) and average time in ns/v. It uses random numbers in [0,2e9) instead of full-range but that's not going to change anything.
Source is very ugly (blame C): 32-bit sort and radix sum. The Singeli version is laid out a lot better if the unusual language isn't an issue. |
Good point, as long as the hardware supports that many read/write heads efficiently. That explains some performance characteristics I've observed.
Yeah, that's the only time I've come across issues. When sorting something larger than the amount of available physical memory, the OS can support other algorithms with swap memory more effectively than radix sort.
When I originally wrote that code I experimented with counting ahead of time and per-pass and found per-pass to be faster. That could have been due to using 64-bit count variables (I also saw no speedup switching from 64-bit count variables to narrower bit-widths). However, I was pretty new to Julia at that time, so I may have made some oversights in that analysis. Plus, the existence of this faster implementation in CQBN means I must be missing something. These performance characteristics are very strange {n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed ∧ n •rand.Range 2e10) × 1e9 ÷ n }¨ 3↓↕8 # presorted 64-bit integers
⟨ 2.6931666666666665 1.6032666666666666 1.5909 1.627 1.9460333333333333 ⟩
{n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed n •rand.Range 2e9) × 1e9 ÷ n }¨ 3↓↕8 # random 32-bit integers
⟨ 2.8407999999999998 2.5551666666666666 2.7207000000000003 2.8192666666666666 4.058433333333333 ⟩
{n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed ∧ n •rand.Range 2e9) × 1e9 ÷ n }¨ 3↓↕8 # presorted 32-bit integers
⟨ 4.8028 3.5467 3.6523333333333334 3.7621333333333333 4.7351 ⟩
{n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed n •rand.Range 2e10) × 1e9 ÷ n }¨ 3↓↕8 # random 64-bit integers
⟨ 50.302800000000005 83.17146666666666 140.1796 139.31573333333333 158.00593333333333 ⟩
{n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed ∧ n •rand.Range 2e10) × 1e9 ÷ n }¨ 3↓↕8 # presorted 64-bit integers
⟨ 1.6696666666666666 1.5579 1.5874666666666666 1.6270333333333333 1.7827333333333333 ⟩ however, it seems to me that the bqn implementation for random 32-bit integers is outstanding:
Pardon my delay, it can take me hours to pick up a new language /h |
CBQN doesn't have 64-bit ints: BQN specifies a single numeric type, which is double-precision floats for most implementations. So 32-bit ints and smaller types are an optimization. Like many things It's also a bit embarrassing that we're not using the sortedness flags to cheat on a pre-sorted argument, so expect that to change some day. I don't think a different count type changes anything, except at small sizes, where zeroing the count array and doing sums is the main cost (hopefully Julia's That's an annoying amount of performance we'll be giving up for adaptivity with SingeliSort, in the >1e5 range. No idea what to do about it though, as radix sort is all-or-nothing. Julia being able to provide multiple algorithms is nice for that. |
@jariji shared the existence of rhsort with me here LilithHafner/InterLanguageSortingComparisons#10. I ran the benchmarks in
gcc bench.c -o bench && ./bench
and what I think are equivalent benchmarks in Julia. I found that Julia's default sorting is 2x faster than rhsort in that particular case. If you're willing I'd appreciate your confirmation to refutation of those results. I benchmarked Julia's sorting with this code snippet that can be pasted into the Julia REPL:The text was updated successfully, but these errors were encountered: