In-Place Radix Sort



This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?


I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.

At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.


I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).

The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.

Use Cases

Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.


Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.

In particular, there are the following things.

First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).

Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).

Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.

To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

Konrad Rudolph

Posted 2009-01-20T21:04:06.667

Reputation: 391 648

1@PeterMortensen For the future, I appreciate corrections and links to add context. However, I don’t particularly appreciate edits that are mere matters of style (“in the order” vs “on the order”, “i.e.” vs “that is”). – Konrad Rudolph – 2013-01-04T16:28:23.790

I realize this is an old question but I am interested in what you were trying to do. I'm teaching a data structures course. Do you still have a data set to play with? How big were the strings? And can you describe what you were sorting the strings for? It strikes me that what you were then doing with the sorted data is a crucial missing part of this question. For example, if your purpose was to have a sorted set of sequences that you could then try to splice together, I think potentially building a trie could be a lot faster than sorting the list intact but that's just a guess. – Dov – 2013-09-07T17:24:53.593

@Dov That’s a loong time ago but I was working with Illumina next-gen sequencing reads (32 bases) and sorting to build a constant-time lookup index (q-gram index). A trie would of course have been possible but using a q-gram index here has proved to work well in practice (Eland etc.). If I remember correctly the index would then be queried by the GPU – but like I said, it was a long time ago for a project that I’m no longer working on. – Konrad Rudolph – 2013-09-08T13:26:54.180

60That is one excellently-written question. – JustinT – 2009-01-20T21:34:14.840

1how small are the small fixed length strings? – EvilTeach – 2009-01-20T21:45:44.693

1@EvilTeach: I've added the use cases. – Konrad Rudolph – 2009-01-20T22:12:17.963

What's the output requirement? Do you want to just produce a sorted list? or are these being put into a database for matching? or...? – Jason S – 2009-01-20T22:47:06.630

@Jason: I just need the list. Post-processing differs drastically. One use case is actually the creation of a suffix array like lookup table (using k-mers instead of suffixes). The current construction with quicksort beats all usual linear-time methods. – Konrad Rudolph – 2009-01-20T23:07:01.177

The question might not be ok. In-place could be as bad as copying if a lot of moves have to be made. – Stephan Eggermont – 2009-01-21T23:30:48.630

2@Stephan: this is all fine and well. But in the case of copying/cache misses I just get a delay. In the case of memory I hit a phyical limit. This is simply nonnegotiable. All those fancy techniques to store parts of the data on disk are definitely slower than the current quicksort solution. – Konrad Rudolph – 2009-01-21T23:53:23.797

2(cont') dsimcha's solution, on the other hand, is definitely faster than quicksort for some inputs. The number of moves may be high and cache locality small but in the real world, it's still good. I've also tweaked the solution slightly to reduce the number of swaps that I need to perform. – Konrad Rudolph – 2009-01-21T23:58:50.057

Algorithm for the fast almost in-place LSD radix sort – Bulat – 2018-08-11T13:25:01.587



Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.

void radixSort(string[] seqs, size_t base = 0) {
    if(seqs.length == 0)

    size_t TPos = seqs.length, APos = 0;
    size_t i = 0;
    while(i < TPos) {
        if(seqs[i][base] == 'A') {
             swap(seqs[i], seqs[APos++]);
        else if(seqs[i][base] == 'T') {
            swap(seqs[i], seqs[--TPos]);
        } else i++;

    i = APos;
    size_t CPos = APos;
    while(i < TPos) {
        if(seqs[i][base] == 'C') {
            swap(seqs[i], seqs[CPos++]);
    if(base < seqs[0].length - 1) {
        radixSort(seqs[0..APos], base + 1);
        radixSort(seqs[APos..CPos], base + 1);
        radixSort(seqs[CPos..TPos], base + 1);
        radixSort(seqs[TPos..seqs.length], base + 1);

Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.


I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.


Posted 2009-01-20T21:04:06.667

Reputation: 44 782

Completely unrelated, but what IDE do you use for D? Been trying to find a nice one... – mpen – 2009-11-24T04:59:33.250

CodeBlocks. It sucks (it's basically just an editor plus basic build automation) and I've been looking for something better, but there are no good IDEs for D2 (the bleeding edge version of the language) yet, and D has enough language level features to eliminate boilerplate code that you don't need a good IDE for it as much as you do for some other languages. – dsimcha – 2009-11-24T05:10:57.970

7If you can live with a 2x pass approach, this extends to radix-N: pass 1 = just go through and count how many there are of each of the N digits. Then if you are partitioning the array this tells you where each digit starts at. Pass 2 does swaps to the appropriate position in the array. – Jason S – 2009-01-20T22:59:43.733

(e.g. for N=4, if there are 90000 A, 80000 G, 100 C, 100000 T, then make an array initialized to the cumulative sums = [0, 90000, 170000, 170100] which is used in place of your APos, CPos, etc. as a cursor for where the next element for each digit should be swapped to.) – Jason S – 2009-01-20T23:01:52.573

I'm not sure what the relation between the binary representation and this string representation is going to be, apart from using at least 4 times as much memory as needed – Stephan Eggermont – 2009-01-22T17:44:55.553

How is the speed with longer sequences? You don't have enough different ones with a length of 5 – Stephan Eggermont – 2009-01-22T21:10:34.530

@dsimcha: Shouldn't your code overflow the stack? – Mehrdad – 2014-03-04T01:42:05.597

This code is more akin to a quicksort with a predefined pivot. The proposition from Jason S. seems much simpler to extend to longer prefixes. – qsantos – 2016-09-12T19:05:59.123

4This radix sort looks to be a special case of the American Flag sort - a well known in-place radix sort variant. – Edward KMETT – 2009-07-15T00:29:23.790


I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.


The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.

I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).

That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.

I have no idea if it works out in practice though.

Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

Nils Pipenbrinck

Posted 2009-01-20T21:04:06.667

Reputation: 64 271

2Two good points; near sorting is a new concept to me, I'll have to read about that. Cache misses is another consideration that haunts my dreams. ;-) I'll have to see about this. – Konrad Rudolph – 2009-01-20T21:45:51.963

It's new for me as well (a couple of month), but once you got the concept you start to see performance improvement opportunities. – Nils Pipenbrinck – 2009-01-20T21:47:28.357

The writes are far from nearly random unless your radix is very large. For example, assuming you sort one character at a time (a radix-4 sort) all writes will be to one of 4 linearly growing buckets. This is both cache and prefetch friendly. Of course, you might want to use a larger radix, and at some pointer you hit a tradeoff between cache and prefetch friendliness and radix size. You can push the break-even point towards larger radices using software prefetching or a scratch area for your buckets with periodic flushing to the "real" buckets. – BeeOnRope – 2017-09-07T22:37:27.287


Based on dsimcha's code, I've implemented a more generic version that fits well into the framework we use (SeqAn). Actually, porting the code was completely straightforward. Only afterwards did I find that there are actually publications concerning this very topic. The great thing is: they basically say the same as you guys. A paper by Andersson and Nilsson on Implementing Radixsort is definitely worth the read. If you happen to know German, be sure to also read David Weese's diploma thesis where he implements a generic substring index. Most of the thesis is devoted to a detailed analysis of the cost of building the index, considering secondary memory and extremely large files. The results of his work have actually been implemented in SeqAn, only not in those parts where i needed it.

Just for fun, here's the code I've written (I don't think anyone not using SeqAn will have any use for it). Notice that it still doesn't consider radixes greater 4. I expect that this would have a huge impact on performance but unfortunately I simply don't have the time right now to implement this.

The code performs more than twice as fast as Introsort for short strings. The break-even point is at a length of about 12–13. The type of string (e.g. whether it has 4, 5, or 16 different values) is comparatively unimportant. Sorting > 6,000,000 DNA reads from chromosome 2 of the human genome takes just over 2 seconds on my PC. Just for the record, that's fast! Especially considering that I don't use SIMD or any other hardware acceleration. Furthermore, valgrind shows me that the main bottleneck is operator new in the string assignments. It gets called about 65,000,000 times – ten times for each string! This is a dead giveaway that swap could be optimized for these strings: instead of making copies, it could just swap all characters. I didn't try this but I'm convinced that it would make a hell of a difference. And, just to say it again, in case someone wasn't listening: the radix size has nearly no influence on runtime – which means that I should definitely try to implement the suggestion made by FryGuy, Stephan and EvilTeach.

Ah yes, by the way: cache locality is a noticeable factor: Starting at 1M strings, the runtime no longer increases linearly. However, this could be fixed quite easily: I use insertion sort for small subsets (<= 20 strings) – instead of mergesort as suggested by the random hacker. Apparently, this performs even better than mergesort for such small lists (see the first paper I linked).

namespace seqan {

template <typename It, typename F, typename T>
inline void prescan(It front, It back, F op, T const& id) {
    using namespace std;
    if (front == back) return;
    typename iterator_traits<It>::value_type accu = *front;
    *front++ = id;
    for (; front != back; ++front) {
        swap(*front, accu);
        accu = op(accu, *front);

template <typename TIter, typename TSize, unsigned int RADIX>
inline void radix_permute(TIter front, TIter back, TSize (& bounds)[RADIX], TSize base) {
    for (TIter i = front; i != back; ++i)
        ++bounds[static_cast<unsigned int>((*i)[base])];

    TSize fronts[RADIX];

    std::copy(bounds, bounds + RADIX, fronts);
    prescan(fronts, fronts + RADIX, std::plus<TSize>(), 0);
    std::transform(bounds, bounds + RADIX, fronts, bounds, plus<TSize>());

    TSize active_base = 0;

    for (TIter i = front; i != back; ) {
        if (active_base == RADIX - 1)
        while (fronts[active_base] >= bounds[active_base])
            if (++active_base == RADIX - 1)
        TSize current_base = static_cast<unsigned int>((*i)[base]);
        if (current_base <= active_base)
            std::iter_swap(i, front + fronts[current_base]);

template <typename TIter, typename TSize>
inline void insertion_sort(TIter front, TIter back, TSize base) {
    typedef typename Value<TIter>::Type T;
    struct {
        TSize base, len;
        bool operator ()(T const& a, T const& b) {
            for (TSize i = base; i < len; ++i)
                if (a[i] < b[i]) return true;
                else if (a[i] > b[i]) return false;
            return false;
    } cmp = { base, length(*front) }; // No closures yet. :-(

    for (TIter i = front + 1; i != back; ++i) {
        T value = *i;
        TIter j = i;
        for ( ; j != front && cmp(value, *(j - 1)); --j)
            *j = *(j - 1);
        if (j != i)
            *j = value;

template <typename TIter, typename TSize, unsigned int RADIX>
inline void radix(TIter top, TIter front, TIter back, TSize base, TSize (& parent_bounds)[RADIX], TSize next) {
    if (back - front > 20) {
        TSize bounds[RADIX] = { 0 };
        radix_permute(front, back, bounds, base);

        // Sort current bucket recursively by suffix.
        if (base < length(*front) - 1)
            radix(front, front, front + bounds[0], base + 1, bounds, static_cast<TSize>(0));
    else if (back - front > 1)
        insertion_sort(front, back, base);

    // Sort next buckets on same level recursively.
    if (next == RADIX - 1) return;
    radix(top, top + parent_bounds[next], top + parent_bounds[next + 1], base, parent_bounds, next + 1);

template <typename TIter>
inline void radix_sort(TIter front, TIter back) {
    typedef typename Container<TIter>::Type TStringSet;
    typedef typename Value<TStringSet>::Type TString;
    typedef typename Value<TString>::Type TChar;
    typedef typename Size<TStringSet>::Type TSize;

    TSize const RADIX = ValueSize<TChar>::VALUE;
    TSize bounds[RADIX];

    radix(front, front, back, static_cast<TSize>(0), bounds, RADIX - 1);

} // namespace seqan

Konrad Rudolph

Posted 2009-01-20T21:04:06.667

Reputation: 391 648

Doesn't this easily overflow the stack for any array of decently-sized strings? (Don't you need O(m) auxiliary stack space, where m is the size of a string in the array?) – Mehrdad – 2014-03-04T01:44:03.840

@Mehrdad The whole point of this question, and of this answer, is that we are working with strings that have strict size limits. Otherwise the algorithm performs abysmal anyway. The strings we’re talking about here are two orders of magnitude shorter than would be necessary to overflow any stack. – Konrad Rudolph – 2014-03-04T08:06:59.590

Oh I see, sorry I missed that. It only fails if we're doing MSD sort though, right? If we're doing LSD it shouldn't be so abysmal... – Mehrdad – 2014-03-04T08:11:36.917

@Mehrdad Performance will always be bad because radix sort has O(nm) best-case performance where “m” is the length of the strings. Comparison based algorithms can do better. – Konrad Rudolph – 2014-03-04T08:14:17.220

... wait, what? How long do comparison sorts take? – Mehrdad – 2014-03-04T08:15:00.830

@Mehrdad Comparison sorts have best-case runtimes of O(n) (i.e. not depending on the string length at all). More relevantly, they have O(n·(logn+x)) average case runtime (the actual performance analysis is quite involved, however), where “x” is a comparatively very small number (due to the fact that most of the time the strings you’re comparing are completely dissimilar and comparison therefore takes not O(m) but substantially less time). – Konrad Rudolph – 2014-03-04T08:18:19.147

I see, so you're doing an average-case analysis, whereas I'm doing a worst-case analysis. It depends on your strings I guess? I had to do a BWT at one point, and using radix sort was quite a bit faster than any comparison sorting I could get my hands on. On the other hand though, radix sort is guaranteed to read every single char exactly once (or twice, I guess). You can't really provide that kind of guarantee with any comparison based sorting algorithm. – Mehrdad – 2014-03-04T08:20:30.700

@Mehrdad The worst case is usually not as relevant for sorting as the average case. And you can get very strong guarantees about average case performance as well (using Smernoff bounds). For long strings, comparison based sorting will be faster (unless all string prefixes are equal, in which case they could potentially be excluded from comparison. What you observed must be a fluke, or very possibly a bug. – Konrad Rudolph – 2014-03-04T08:36:57.120

It wasn't a bug or a fluke because we verified it numerous times, but the dataset may not have been very average. I didn't know about Smernoff bounds, thanks for the pointer! – Mehrdad – 2014-03-04T08:57:59.447

I just re-ran my program. When I replace one of the radix sorts with std::sort, the run time increases from to 3185 ms to 5552 ms. Definitely correct and definitely not a fluke. (I am using _mm_prefetch in the code for radix sort, but even when I remove that, making it standard code, the running time is still 4297 ms, which is noticeably faster than std::sort's.) The code is taking the BWT of 25,575 repetitions of all except the last 4 lines of this text, with all spaces and punctuations removed. (So the file is 10 MB.)

– Mehrdad – 2014-03-04T10:32:22.487

@Mehrdad Are you by any chance accidentally copying the strings in the comparer that is passed to std::sort? Otherwise the first two characters distinguish all (but two of) those lines so a comparison-based sort would be very fast here. Then again, the strings themselves are not very long, so radix sort might well be slightly faster. Remember, we were talking about cases with long strings. – Konrad Rudolph – 2014-03-04T10:53:41.743

Shoot -- I totally forgot, you're right -- I'm sorting integers, not strings. The integers are related to the original string of course (if I remember correctly, they're the lexicographic suffix ranks), but sorting them only requires examining 3 characters before the algorithm recurses, so it's essentially sorting tuples of 3 integers, or the equivalent of 12-byte strings... which are very short, I totally forgot that. Thanks for pointing that out, that explains what's happening. – Mehrdad – 2014-03-04T11:48:17.450

By the way, do you think a hybrid MSD/LSD radix sort would be practical/useful? By which I mean, you'd perform a variable-length MSD radix sort, initially examining 1 char at a time, then 2 chars, then 4, then 8, etc... but it would be such that examining N chars consists of performing LSD radix sort on those N chars. That should help the algorithm stop early on average, while still bounding the worst-case time to O(mn) without causing a stack overflow. – Mehrdad – 2014-03-04T12:06:44.993

@Mehrdad The idea sounds interesting. I don’t have the time now to think about it in detail but it might indeed have the desired characteristics. – Konrad Rudolph – 2014-03-04T12:47:22.620

Yeah I'll probably try it out. Btw, is it possible to make the in-place radix sort stable, like its out-of-place counterpart? – Mehrdad – 2014-03-04T20:55:00.017


Your first link is broken, might I propose this one instead?

– Cameron – 2014-06-16T20:55:04.627

@Cameron Ah, thanks. I’ve fixed it. – Konrad Rudolph – 2014-06-17T09:32:03.400

@KonradRudolph: What are Container, Value, Size, ValueSize, and length? I mean, I can guess, but it's sort of surprising to see there unexplained. – Mooing Duck – 2015-01-08T17:41:21.683

@KonradRudolph: Also, after making minor tweaks to your code to make it compile, it doesn't seem to do anything. The radix_permute looks similar to the function I designed, but I don't understand how your code is supposed to work, even when I step through it. :(

– Mooing Duck – 2015-01-08T17:53:14.580

@MooingDuck Well like I said, the code is an excerpt from my working copy of the SeqAn library, which is where these types are from. Regarding your second comment, unfortunately, since I ended up not using the code after all I didn’t comment/document it thoroughly, and I don’t remember how radix_permute was supposed to work. :-( I suspect a typo somewhere since the bounds that the permute function receives are always all 0, correct? – Konrad Rudolph – 2015-01-09T00:07:57.657

@KonradRudolph: It was indeed passing zeros for bounds, which I found suspicious. – Mooing Duck – 2015-01-09T00:25:48.590


You can certainly drop the memory requirements by encoding the sequence in bits. You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits. For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.

If there is a way to reduce the number of valid 'words', further compression may be possible.

So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64. Initialize them to zero. Iterate through your very very large list of 3 char sequences, and encode them as above. Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.

Next, regenerate your list.

Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.

A sequence of 4, adds 2 bits, so there would be 256 buckets. A sequence of 5, adds 2 bits, so there would be 1024 buckets.

At some point the number of buckets will approach your limits. If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.

I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.

Here is a hack that shows the technique

#include <iostream>
#include <iomanip>

#include <math.h>

using namespace std;

const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
      int *bucket = NULL;

const char charMap[4] = {'A', 'C', 'G', 'T'};

void setup
    bucket = new int[bucketCount];
    memset(bucket, '\0', bucketCount * sizeof(bucket[0]));

void teardown
    delete[] bucket;

void show
    int encoded
    int z;
    int y;
    int j;
    for (z = width - 1; z >= 0; z--)
        int n = 1;
        for (y = 0; y < z; y++)
            n *= 4;

        j = encoded % n;
        encoded -= j;
        encoded /= n;
        cout << charMap[encoded];
        encoded = j;

    cout << endl;

int main(void)
    // Sort this sequence

    size_t testSequenceLength = strlen(testSequence);


    // load the sequences into the buckets
    size_t z;
    for (z = 0; z < testSequenceLength; z += width)
        int encoding = 0;

        size_t y;
        for (y = 0; y < width; y++)
            encoding *= 4;

            switch (*(testSequence + z + y))
                case 'A' : encoding += 0; break;
                case 'C' : encoding += 1; break;
                case 'G' : encoding += 2; break;
                case 'T' : encoding += 3; break;
                default  : abort();


    /* show the sorted sequences */ 
    for (z = 0; z < bucketCount; z++)
        while (bucket[z] > 0)


    return 0;


Posted 2009-01-20T21:04:06.667

Reputation: 19 960

Why compare when you can hash eh? – wowest – 2009-05-08T01:32:59.033

1Damn straight. Performance is generally an issue with any DNA processing. – EvilTeach – 2009-05-08T12:49:42.987


If your data set is so big, then I would think that a disk-based buffer approach would be best:

sort(List<string> elements, int prefix)
    if (elements.Count < THRESHOLD)
         return InMemoryRadixSort(elements, prefix)
         return DiskBackedRadixSort(elements, prefix)

DiskBackedRadixSort(elements, prefix)
    DiskBackedBuffer<string>[] buckets
    foreach (element in elements)

    List<string> ret
    foreach (bucket in buckets)
        ret.Add(sort(bucket, prefix + 1))

    return ret

I would also experiment grouping into a larger number of buckets, for instance, if your string was:


the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.


Posted 2009-01-20T21:04:06.667

Reputation: 7 581

We use memory-mapped files for some applications. However, in general we work under the assumption that the machine provides just barely enough RAM to not require explicit disk backing (of course, swapping still takes place). But we are already developing a mechanism for automatic disk-backed arrays – Konrad Rudolph – 2009-01-20T21:37:40.000


I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:

  1. You control the reading of the data
  2. You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.

The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.

Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.

Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.

That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(

See the Wikipedia articles:


Posted 2009-01-20T21:04:06.667

Reputation: 2 691

1Good suggestion. However, I've already tried this and in my particular case the overhead of maintaining a heap is larger than just accumulating the data in a vector and sorting once all the data has arrived. – Konrad Rudolph – 2009-01-20T22:21:36.743


Performance-wise you might want to look at a more general string-comparison sorting algorithms.

Currently you wind up touching every element of every string, but you can do better!

In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.

A decent general purpose implementation of burstsort is available on source forge at - but it is not in-place.

For comparison purposes, The C-burstsort implementation covered at benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

Edward KMETT

Posted 2009-01-20T21:04:06.667

Reputation: 26 819

I'll definitely have to look at burst sort – although at the moment I don't see how the trie could be built in-place. In general suffix arrays have all but replaced suffix trees (and thus, tries) in bioinformatics because of superior performance characteristics in practical applications. – Konrad Rudolph – 2009-07-16T09:39:26.330


You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.

Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.


Posted 2009-01-20T21:04:06.667

Reputation: 2 290

The radix sort presented in this book isn’t in-place so it’s not usable for this purpose. As for the string compaction, I am (of course) already doing this. My (more or less) final solution (posted below) doesn’t show this because the library allows me to treat them like normal strings – but the RADIX value used can (and is) of course be adapted to larger values. – Konrad Rudolph – 2010-01-23T19:23:36.010


"Radix sorting with no extra space" is a paper addressing your problem.


Posted 2009-01-20T21:04:06.667

Reputation: 171

Looks promising, though the problem has actually already been solved. Still, this goes into my reference library. – Konrad Rudolph – 2010-08-05T08:58:41.463


You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).

Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.


Posted 2009-01-20T21:04:06.667

Reputation: 8 845

The trie has the same problem as my naive implementation: it requires O(n) additional memory which is simply too much. – Konrad Rudolph – 2009-01-21T11:40:05.990


I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

Darius Bacon

Posted 2009-01-20T21:04:06.667

Reputation: 12 983


Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets. You can look at:

You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.


Posted 2009-01-20T21:04:06.667

Reputation: 1 157

bill: could you explain what advantages this qsort function has over the std::sort function provided by C++? In particular, the latter implements a highly sophisticated introsort in modern libraries and inlines the comparison operation. I don't buy the claim that it performs in O(n) for most cases, since this would require a degree of introspection not available in the general case (at least not without a lot of overhead). – Konrad Rudolph – 2009-06-14T14:30:19.143

I'm not using c++, but in my tests the inline QSORT can be 3 times faster than the qsort in stdlib. The ti7qsort is the fastest sort for integers (faster than inline QSORT). You can also use it to sort small fixed size data. You must do the tests with your data. – bill – 2009-06-14T15:12:48.340


First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.

OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.

Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:

  • Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
  • Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
  • Bitmap encoding of the last three characters of a string.

For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.

This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.

For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off

You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.

Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

Stephan Eggermont

Posted 2009-01-20T21:04:06.667

Reputation: 13 774

I don't see how radix sort becomes easier in my case if I use a bit-packed representation. By the way, the framework I use actually provides the possibility of using a bit-packed representation but this is completely transparent for me as a user of the interface. – Konrad Rudolph – 2009-01-20T23:01:03.450

Not when you look at your stopwatch :) – Stephan Eggermont – 2009-01-21T23:13:09.307

I'll definitely have a look at Judy trees. Vanilla tries don't really bring much to the table though because they behave basically like a normal MSD radix sort with less passes over the elements but require extra storage. – Konrad Rudolph – 2009-01-22T00:02:58.923


dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.

I suggest a very simple approach:

  1. Empirically estimate the largest size m for which a radix sort is efficient.
  2. Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
  3. Mergesort the resulting sorted blocks.

Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.

Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.


Posted 2009-01-20T21:04:06.667

Reputation: 42 934


It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.


Posted 2009-01-20T21:04:06.667

Reputation: 25 386

Actually, the solution I use is very closely related to the Flag Sorting algorithm. I don't know if there's any relevant distinction. – Konrad Rudolph – 2009-01-24T22:16:59.713


Never heard of the American Flag Sort, but apperently that's what I coded: It's currently outperforming std::sort, and I'm certain a multidigit digitizer could go faster still, but my test suite is having memory problems (not the algorithm, the test suite itself)

– Mooing Duck – 2015-01-09T18:03:13.787

@KonradRudolph: The big distinction between the Flag sort and other radix sorts is the counting pass. You're right that all the radix sorts are very closely related, but I wouldn't consider yours a Flag sort. – Mooing Duck – 2015-01-09T18:05:49.770

@MooingDuck: Just took some inspiration from your sample there - I got stuck in my own independent implementation, and yours helped me get back on track. Thanks! One possible optimization - I haven't gotten far enough here to see if it's worthwhile yet: If the element in the position you are swapping TO happens to already be where it needs to be, you may want to skip that and advance to one that isn't. Detecting this will require extra logic, of course, and possible extra storage as well, but since swaps are expensive relative to compares, it may be worth doing. – 500 - Internal Server Error – 2018-01-24T20:21:07.397