Skip to content

Latest commit

 

History

History
88 lines (79 loc) · 4.94 KB

README.md

File metadata and controls

88 lines (79 loc) · 4.94 KB

Fast-Radix-Sort

An optimized implementation of the Radix LSD sorting algorithm for integers, written in C.

Algorithm:

Radix Sort is a non-comparative sorting algorithm able to sort a list of N integers in O(kn) time complexity.
The presented program showcases an implementation of said algorithm written in ANSI C and optimized for speed.

The function sorts 32-bit integers (signed or unsigned), one byte at a time (a bit like the American Flag sorting,
but starting with the Least Significant Digit/Byte).

Radix sort is a stable sorting, meaning every two numbers with the same value appear in the same order in the sorted output as they appear in the input.

This implementation follows an out-of-place approach, meaning it uses an helper array in order to sort the original array.

The program supports all main C compilers (GCC, Clang, MCVS, ...) and architectures.

For more information about the algorithm itself, check the Wikipedia Page.

Optimizations:

The sorting function was made without the help of compiler optimizations in mind.
Therefore, it uses some microtimizations like macros and registers to speed some things up.
Be mindful that compilers like GCC don't provide full-fledged optimizations unless told so.
The ideia is to just include the function in your program and get the performance right out of the box.


  • Taking advantage of cache locaity by sorting bytes in different orders when the input array is bif enought to do so. This means applying the "normal" radix algorithm in subarrays.

  • Use of powers of 2 for the expoents and bucket size in order to use shift and bitwise operations (expoent = 8 in order to sort 1 byet per iteration). This works a lot like the American Flag algorithm used to sort strings.

  • Small preliminary check of the initial unsorted array to determine number of bytes to sort. Special useful in randomly shuffled arrays.

  • The indexes of the buckets express the amount of elements of that respective index in the original array. There is also a array of pointers so that each pointer has the adress in the helper array where the given offset should start.

  • As there are only 4 iterations at max (for a 32 bit integer at least), instead of copying the whole helper array to the original at the end of each iteration, the algorithm switches the purpose of these two arrays in order to reduce copying overhead (eventually correcting this at the end if it stops in a even number of bytes).

  • Neglecting the shift operation when sorting the first byte (equals >> 0).

  • Filling the array of buckets with a blocksize of 4 elements at a time.

  • The algorithms is the same even if the original array contains integers of different signs. The only thing we have to do at the end is starting from the negative numbers instead (in case of different signs). This task has a relatively small overhead as it is done with the help of memcpy and memmove.

    This only applies if we're not sorting all the bytes (see last point)!


  • If, when sorting a certain byte, one of the buckets has all the elements, it means sorting that byte won't do anything at all in the final result, so skip ahead this byte.

  • If we have to sort all bytes either way, we can deal with negative numbers by splitting the for loop that assign the pointers to the helper array in half. This works because the last bit is the sign bit, thus in the last byte the negative numbers will correspond to the buckets with an index higher that 127. Although a small improvement this allows for sorting negative and positive integers in the same time complexity (if sorting all bytes), bringing down the time for the worst case scenario in a random shuffled array.

Mind that as the algorithm works byte per byte, having 1 bit or 8 bits maximun is the same timewise.

Use and compilation:

Provided you are in a Linux based OS, enter the following commands in your terminal:

Compile with: gcc -Wall radix.c -o radix
Execute with: ./radix

More:

The code looks ugly!

I couldn't agree more...
This is mainly because of the excessive use of microptimizations, macros and other small improvements that damages the code readability.
Keep in mind the main objective is performance.

Further improvement:

This algorithm still has space for improving and if you want to optimize it further I can provide a cleaner version for clarity.
If you are kind enough, let me know what optimizations you were able to do.