-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collaborating? #1
Comments
Some elegant way to sort random generic data (like 16 values) would be useful, pdqsort does this rather well, though I'm not sure if there's an added cost for other distributions. That's a clever improvement, I'll do some testing and see about incorporating it. |
I was able to adapt pdqsort's method for handling many equal values, which brings the performance on the "generic" case roughly in line with pdqsort. Pushed to https://github.com/mlochbaum/fluxsort . There doesn't seem to be any significant performance impact in other tested cases, but I found that The strategy is the same but reversed left to right to account for the different recursion order. So, before partitioning it checks to see if the pivot is equal to the element after the current slice. The only extra step is to copy that element-after to the swap space before recursing into it. It might seem like the fact that pdqsort can put the pivot in the middle right after partitioning and fluxsort can't would make a difference, but it doesn't because the right partition is sorted (so it starts with the pivot) before visiting the left. How much do you care about code style here? I've copy-pasted the partitioning code to match the existing style, but I'd find a version that uses macros for the loop unrolling and two copies easier to read. |
As far as I can tell the flux_max function isn't stable? Moving the maximum element to the end might be possible to accomplish in n comparisons and ~ log n swaps on randomized data. I'm having a hard time wrapping my head around it. As for code style, I can be a bit particular, but I'm open to suggestions. On my end, I've experimented with a pseudomedian of 25 that detects sorted partitions. It improves sorting of generic data slightly, but it's barely worth the trouble. |
On second thought, I think the bubble sort would be the right way to go. n comps and n swaps. |
No, I guess it isn't. I'd been thinking about ensuring that the maximum element doesn't get swapped with anything less than it, but the element at the end winds up out of place. It's not really necessary because you can track whether the current slice comes at the end of the array or not, it's just that you have to add cases when it's used. But this condition starts false and becomes true after the first base case, so it's not that complicated. Finding the maximum is really fast with SIMD instructions (on appropriate types of course). Anything more complicated is probably too much overhead. |
I reckon there will be a variety of solutions. I'll try to wrap my head around this partitioning logic in the next few days and get back to you. |
Remark on psuedomedians: the order of the candidates does matter, and I think both pdqsort and flux are too systematic. I haven't analyzed the 15-candidate version, but the case with 3-by-3 is that when combining values, two that are on the same side always outweigh a third. If values in the middle are higher and those at the edges are lower, then Fluxsort ends up with two low groups and one high group. It will then pick the median from one of the low groups even though the true median is probably the highest value from a low group. pdqsort spreads out the groups but gets the same problem in the first round of medians, so that every group will end up with a low median. A scheme like 125/048/367 should be better. I also wonder if it might be preferable to just take the true median of 5 or 7 (I don't know how hard that is) instead. Fluxsort has a lot of overhead coming from median calculations and it seems to me like it should be possible to reduce it. |
3x3 is as good at avoiding the worst case partition as 1x7 with fewer comparisons. 1x7 requires 6+5+4+3+2+1 = 21 comparisons worst case while 3x3 is at 3+3+3+3 = 12. I guess a good baseline is to try to beat std:stable_sort for any distribution, this might be possible by adding run detection to the analyzer, since there's a tipping point where branch prediction outperforms branchless. |
I was just wondering if it makes sense to search for an initial run at the beginning of SIMD is also relevant, since a run search can easily check multiple elements at a time. But it doesn't work in plain C because the compiler's not allowed to access elements after the stopping point. |
It's possible to perform a branchless run analysis at a ~3% performance cost for random, useful for detecting pipe organ distributions. |
I'm getting good performance with a branchless bubble sort. I typically use void FUNC(flux_max)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
VAR swap, *ptb, *pta = array;
while (--nmemb)
{
ptb = &pta[cmp(pta, pta + 1) > 0];
swap = *pta; *pta++ = *ptb; *ptb = swap;
}
} Edit: Previous code was going out of bounds, probably why the readings were odd. Not ideal performance, ~3% slower. |
I took a closer look at the actual partitioning logic, and I think the analyzer would take care of most cases where flux_max would be beneficial? Another concern is that it takes an extra n comparisons, which isn't a big deal, but it might be to some people. As far as I can tell most of the performance gain can be obtained from the One of the things I'm trying to do is to minimize stack allocation during recursion. In my current setup that does result in some unnecessary pivot calls, but it seems to slightly improve performance on random without a notable impact on generic. |
This seems to work to deal with the cnt * cnt % 10000 distribution for large arrays. Perhaps a bit heavy-handed. It's still not quite on par with pdqsort for 10M+ but it gets pretty close. Should be possible to keep track of the previous pivot with a maximum_of_twentyfive routine and passing it along recursively, I'm looking into that next. void FUNC(fluxsort)(void *array, size_t nmemb, CMPFUNC *cmp);
VAR FUNC(median_of_hundredtwentyeight)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
VAR *pta, swap[128], tmp;
size_t cnt, div = nmemb / 128;
pta = array;
for (cnt = 0 ; cnt < 128 ; cnt++)
{
swap[cnt] = pta[0];
pta += div;
}
FUNC(fluxsort)(swap, 128, cmp);
return swap[64];
} VAR FUNC(median_of_twentyfive)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
size_t v0, v1, v2, v3, v4, div = nmemb / 64;
v0 = FUNC(median_of_five)(array, div * 4, div * 1, div * 2, div * 8, div * 10, cmp);
v1 = FUNC(median_of_five)(array, div * 16, div * 12, div * 14, div * 18, div * 20, cmp);
v2 = FUNC(median_of_five)(array, div * 32, div * 24, div * 30, div * 34, div * 38, cmp);
v3 = FUNC(median_of_five)(array, div * 48, div * 42, div * 44, div * 50, div * 52, cmp);
v4 = FUNC(median_of_five)(array, div * 60, div * 54, div * 56, div * 62, div * 63, cmp);
return array[FUNC(median_of_five)(array, v2, v0, v1, v3, v4, cmp)];
} if (nmemb > 1024)
{
if (nmemb > 65536)
{
piv = FUNC(median_of_hundredtwentyeight)(ptx, nmemb, cmp);
}
else
{
piv = FUNC(median_of_twentyfive)(ptx, nmemb, cmp);
}
}
else
{
piv = FUNC(median_of_nine)(ptx, nmemb, cmp);
} |
I think that strategy is entirely reasonable, actually, and it doesn't even add much to the code size. It should be able to do the median in already-allocated memory (in extreme cases I still think switching to quadsort on a single bad partition isn't really viable. It should switch when the partition size is too large relative to the depth, for a cumulative measure. Statistics also needed here. |
From this answer, medians follow a beta distribution: the probability that a median of I computed this table of values, where the first column is If you're saving the sorted candidates between iterations then the math works out differently. You could either select more candidates at the start, or merge in candidates to get to the right amount at a lower cost (increasing the size by a factor of |
That lines up pretty closely with my own pivot choices based on benchmarks. I might be able to write a dynamic pivot selection routine. |
I ran some calculations and it seems like the cubic root of the partition size might be a good value for the pseudomedian. I'm pretty happy with some of the changes I made so I'm thinking of pushing out a new version. How would you like me to credit you for your contributions? I've incorporated the loop unrolling and s_size == 0 optimization. |
My branch uses the I also added a pseudorandom value to each pivot candidate. Running a full PRNG step for each pivot adds too much overhead, so I split Marsaglia's Xorshift into three steps and do only one for each candidate. The adjacent candidates won't be independent, but the overall selection will end up with plenty of randomness. I'm thinking about how to apply something similar to the smaller pivot computations as well. As far as I can tell the only reason not to use randomize candidates is performance: the goal is to avoid patterns in pivot selection, and PRNGs are designed specifically for this purpose. I'm not really concerned about the attribution (there's a record of what happened here, anyway). If you've copied any actual code instead of just ideas then you could add me to the license text. |
I've also been looking at insertion sort. Replacing quadsort with insertion sort directly is slightly slower, but removing all the quadsort calls and running unguarded insertion sort on the entire array at the end is the same speed for random data or even slightly faster, and 2-3% faster after increasing |
Increasing FLUX_OUT to 32 increases the comparison count by quite a bit, I'm trying to keep performance balanced for both integers and strings. I updated the source code with the discussed changes. I think I'll create a separate repository that purely focuses on benchmarking pivot selection. |
I'm also very interested in pivot selection. I'd never have guessed that stopping at 9 (or 15 even!) candidates is measurably worse. I just added randomization to the 15-candidate version, based on this generator because I only use a few low bits and the smaller numbers scramble these faster. This randomization makes Too much overhead to do this for 9 candidates. Maybe using a fixed random sequence is best here. The main thing is to avoid doing too badly on exactly periodic data, because that kind of thing will definitely show up. It's also worth looking at a 3-candidate version for the smallest partitions (64? 128?), since it means there's less pressure on the 9-candidate one to be as fast as possible. The 1 in 65,536 chance you give for pseudomedian-of-9 failing is too optimistic. True, it requires 4 candidates to be in the bad range, but there are many combinations of candidates that will cause the failure so the probability is higher. The actual probability that the larger partition fails is Median of 5 groups of 3 gives 1 in 51161, which is probably fine, and slightly better than 3 of 5. |
I'm a bit hesitant about randomization since it might make it difficult to pinpoint bad patterns. I'll run some tests to see for myself if the failure rate is 1 in 1687 for the pseudomedian of 9, I've wondered if I was too optimistic thinking it would be I've switched to the pseudomedian of 25 for partitions above 1024 which seems to do fairly well. I'm doing some testing with larger pseudomedians and what I've dubbed quasimedians (3x3x3, 5x5x5, etc) as well.
|
Nice! I computed tables of failure probability for median combinations. They're shown as negative log base two of the probability, so 20 is roughly one in a million and so on. This gives some indication of the power of the combination. 5-of-5 is 22.89, slightly stronger than 3-of-3-of-3 at 20.86. And 7 is weaker than 3-of-3. The failure probability in the first line can be changed but it looks like these relationships stay stable. |
With brute force testing I come out at 1 in 1333.33 for the pseudomedian of 9. In my own tests I haven't been able to beat pseudomedian of 9 with the median of 3 for small partitions.
9 median returns the center value, 9 sloppy returns the center value but will be off by 1 in 66% of cases. |
I don't get the concern about randomization breaking detection of bad patterns. The only thing that can slow down a branchless quicksort is bad pivot selection, right? But what qualifies as bad depends entirely on what pivot selection algorithm is used. PRNG test suites basically stress-test them against as many possible patterns as possible, so adding randomness in a suitable way should mean the only real-world data that has any chance of breaking them is malicious input. For detecting patterns where fluxsort would be better, the variable-length selection seems ideal since you can do whatever statistics you want on the sample. My method just adds relatively small random offsets to evenly-spaced values, so they're still in order and approximately evenly spaced. I think this is more robust for pattern detection because with exact even spacing you might hit a part of the array that isn't representative (maybe if values come from a table with rows of 5 and
|
I guess I've looked into run detection in I also wrote a dynamic pseudomedian selector. I get the impression though that on large partitions cache pollution outweighs the cpu advantage, so getting the median of 128 or 256 with fluxsort seems to perform better. I'll post the code in case you want to play around with it. VAR FUNC(flux_median)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
VAR swap[base];
size_t div = nmemb / base;
unsigned char x, y, mid = base / 2;
unsigned char val, t[base];
for (x = 0 ; x < base ; x++)
{
t[x] = 0;
swap[x] = array[0]; array += div;
}
for (x = 0 ; x < base - 1 ; x++)
{
for (y = x + 1 ; y < base ; y++)
{
val = cmp(swap + x, swap + y) > 0; t[x] += val; t[y] += !val;
}
if (t[x] == mid) return swap[x];
}
return swap[base - 1];
}
VAR FUNC(flux_pseudomedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
VAR swap[base];
size_t div = nmemb / base;
unsigned char cnt;
for (cnt = 0 ; cnt < base ; cnt++)
{
swap[cnt] = FUNC(flux_median)(array, div, base, cmp); array += div;
}
return FUNC(flux_median)(swap, base, base, cmp);
}
VAR FUNC(flux_quasimedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
VAR swap[base];
size_t div = nmemb / base;
unsigned char cnt;
for (cnt = 0 ; cnt < base ; cnt++)
{
swap[cnt] = FUNC(flux_pseudomedian)(array, div, base, cmp); array += div;
}
return FUNC(flux_median)(swap, base, base, cmp);
} Example: |
Some small updates.
Still need to look at a few other things, like your suggestion to keep partitioning the leftover of a reverse partition. |
Some more updates.
I think that's about all I do with fluxsort for now. |
First off, sorry again to have jumped to conclusions on Hacker News. As a programming language guy I'm in the market for a stable sort and Fluxsort seems to have the fundamentals down!
I wanted to know if you're interested in incorporating some or all of what I do into this repository. I'd like to help improve Fluxsort's worst cases and pattern recognition with techniques like pdqsort uses, and expand the interface to support other operations. Some possible topics, which I can also split into individual issues:
(pdqsort depends on instability a lot, but I think there are usually ways to avoid this)
As evidence that I have some clue what I'm talking about, here's a small improvement: the following inner loop for
flux_partition
gives about a 3% speedup in my benchmarks on random 1e6-element 4-byte int arrays.The text was updated successfully, but these errors were encountered: