If you need to sort some data on the CPU, and you want to do it efficiently, you can do a lot better than std::sort.

For the uninitiated, std::sort is the sort function in the C++ standard library. It's a comparison-based sort, usually implemented as quicksort.

Recently I did some work on a parallel radix sort. A radix sort is not a comparison based sort. Instead it uses the actual values of 'digits' (chunks of bits) of the key to efficiently sort the items. As such radix sorts can run faster than comparison based sorts, both in terms of computational complexity and in practical terms.

Some results

The task is to sort N single-precision float values (32-bit floats). The floats have random values.

For the timings below N = 2^20 = 1048576. Best values from ~10 runs are given. Computer: i7-3770 (4 hardware cores), Platform: Windows 8.1 64-bit, compiler: Visual studio 2012, full optimisations + link time code generation.

std::sort:           0.08427573208609829 s  (12.442205769612862 M keys/s)
tbb::parallel_sort:  0.016449525188363623 s (63.745061817453646 M keys/s)
Radix sort:          0.013582046893134248 s (77.20309083383134  M keys/s)
Parallel radix sort: 0.006327968316327315 s (165.70500160288134 M keys/s)
Note that my parallel radix sort is roughly 13 times faster than std::sort!

Breaking down the speedup over std::sort, it looks something like this:

std::sort -> serial radix sort gives a ~6.2 times speedup.

serial radix sort -> parallel radix sort gives a ~2.1 times speedup.

This is way below a linear speedup over the serial radix sort, considering I am using 8 threads in the parallel sort. This likely indicates that memory bandwidth is the limiting factor.

Limitations of radix sort

Radix sort is not always the appropriate sorting choice. It has some memory and space overheads that make it a poor choice for small arrays. Once the number of elements gets smaller than 300 or so I have measured std::sort to be faster.

In addition, to be able to radix sort items, you need to be able to convert each key into a fixed length number of bits, where the ordering on the bit-key is the same as the desired ordering. Therefore you can't sort e.g. arbitrary length strings with radix sort. (Although you could probably use some kind of hybrid sort - radix sort first then finish with a quicksort or something like that)

Conclusion

Parallel radix sort is awesome!

If you are really desperate to sort fast on the CPU and want to use/licence the code, give me an email. Otherwise we'll probably open-source this parallel radix sorting code, once we've got Indigo 4 out the door.

Implementing a parallel radix is also quite fun and educational. I might write an article on how to do it at some point, but writing your own is fun as well!

Further reading

'Radix sort' on Wikipedia

'Radix Sort Revisited' by Pierre Terdiman - has a nice technique for radix sorting floating point data.

'Radix Tricks' by Michael Herf - has another (better?) technique for radix sorting floating point data, and some other tricks.