185

85

This is a long text. Please bear with me. Boiled down, the question is: **Is there a workable in-place radix sort algorithm**?

## Preliminary

I've got a huge number of *small fixed-length* strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.

At the moment, I use `std::sort`

which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work **much** better in practice.

## Details

I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (*N* > 5,000,000).

The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.

## Use Cases

Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.

## Research

Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.

In particular, there are the following things.

First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions `dest_group`

and `dest_address`

without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least `dest_address`

isn’t trivial).

Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).

Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.

**To recap**: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

1@PeterMortensen For the future, I appreciate corrections and links to add context. However, I don’t

particularlyappreciate edits that are mere matters of style (“in the order” vs “on the order”, “i.e.” vs “that is”). – Konrad Rudolph – 2013-01-04T16:28:23.790I realize this is an old question but I am interested in what you were trying to do. I'm teaching a data structures course. Do you still have a data set to play with? How big were the strings? And can you describe what you were sorting the strings for? It strikes me that what you were then doing with the sorted data is a crucial missing part of this question. For example, if your purpose was to have a sorted set of sequences that you could then try to splice together, I think potentially building a trie could be a lot faster than sorting the list intact but that's just a guess. – Dov – 2013-09-07T17:24:53.593

@Dov That’s a loong time ago but I was working with Illumina next-gen sequencing reads (32 bases) and sorting to build a constant-time lookup index (q-gram index). A trie would of course have been possible but using a q-gram index here has proved to work well in practice (Eland etc.). If I remember correctly the index would then be queried by the GPU – but like I said, it was a long time ago for a project that I’m no longer working on. – Konrad Rudolph – 2013-09-08T13:26:54.180

60That is one excellently-written question. – JustinT – 2009-01-20T21:34:14.840

1how small are the small fixed length strings? – EvilTeach – 2009-01-20T21:45:44.693

1@EvilTeach: I've added the use cases. – Konrad Rudolph – 2009-01-20T22:12:17.963

What's the output requirement? Do you want to just produce a sorted list? or are these being put into a database for matching? or...? – Jason S – 2009-01-20T22:47:06.630

@Jason: I just need the list. Post-processing differs drastically. One use case is actually the creation of a suffix array like lookup table (using k-mers instead of suffixes). The current construction with quicksort beats all usual linear-time methods. – Konrad Rudolph – 2009-01-20T23:07:01.177

The question might not be ok. In-place could be as bad as copying if a lot of moves have to be made. – Stephan Eggermont – 2009-01-21T23:30:48.630

2@Stephan: this is all fine and well. But in the case of copying/cache misses I just get a delay. In the case of memory I hit a phyical limit. This is simply nonnegotiable. All those fancy techniques to store parts of the data on disk are definitely slower than the current quicksort solution. – Konrad Rudolph – 2009-01-21T23:53:23.797

2(cont') dsimcha's solution, on the other hand, is definitely

fasterthan quicksort for some inputs. The number of moves may be high and cache locality small but in the real world, it's still good. I've also tweaked the solution slightly to reduce the number of swaps that I need to perform. – Konrad Rudolph – 2009-01-21T23:58:50.057Algorithm for the fast almost in-place LSD radix sort – Bulat – 2018-08-11T13:25:01.587