What's the most efficient way to compare two blocks of memory in the D language?

7

4

I need a comparison function for blocks of memory for doing binary searches on arrays of bytes in the D programming language. It does not need to have any useful semantics. It only needs to be fast and be a valid comparison function (one that produces a total ordering). The blocks of memory to be compared are already known to be the same length.

C's memcmp is actually pretty slow because it tries to preserve useful string comparison semantics, which I don't need. Below is the best I've come up with so far. Does anyone know of anything better, preferably w/o dipping into non-portable CPU-specific instructions?

// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics.  It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
    for(; n >= uint.sizeof; n -= uint.sizeof) {
        if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
            return -1;
        } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
            return 1;
        }
        lhs += uint.sizeof;
        rhs += uint.sizeof;
    }

    for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
        if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
            return -1;
        } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
            return 1;
        }
        lhs += ubyte.sizeof;
        rhs += ubyte.sizeof;
    }

    return 0;
}

Edit: I've read up on SSE and I don't want to use it for 3 reasons:

  1. It's not portable.
  2. It requires programming in ASM.
  3. The comparison instructions assume your data is floating points, which might be problematic if some data happens to match the pattern for a NaN.

dsimcha

Posted 2009-11-06T01:49:13.900

Reputation: 44 782

D seems pretty transliterable to C. The code presented is not portable; it presumes byte order and no alignment requirement or preference, which pretty much yells x86/x86-64. Which is LSB-first for ints, so the (uint) comparison will return the wrong sign value. Hence maybe the "semantics" comment :-) – Mischa – 2012-01-28T20:42:05.670

All my benchmarks (ranging across Xeon, Core2 and divers AMD64) give every variant of REP CMPS a thumbsdown. Pity, because gcc 4.4 -O3 generates an inline REP CMPSB for memcmp by default (even without -finline-all-stringops. It's sort of depressing that they added SSE4.2 PCMPESTRI instead of devoting a bit of chip space to improving REP CMPS/MOVS/SCAS. I've tried various old-school asm to make a better memcmp; nothing that works well in all cases. SSE2 is actually the best, and if you are working x86 anyway, SSE2 is on every intel-compatible chip since 2005. – Mischa – 2012-01-28T20:46:41.877

Also, you don't need to use asm to program SSE. Check out <xmmintrin.h> – Mischa – 2012-01-29T01:53:02.747

For examples, dealing with bytes (and bits) you can check out http://mischasan.wordpress.com/2011/06/22/what-the-is-sse2-good-for-char-search-in-long-strings/ and other articles. I would reproduce source code here except, as you can see, I'm having a bit of a struggle with formatting text here (damnyouenterkey).

– Mischa – 2012-01-29T02:12:47.643

Perhaps: http://faydoc.tripod.com/cpu/scasb.htm (x86)

– None – 2009-11-06T02:02:21.363

If you want to go really low-level, you could try inline assembly with rep cmpsb, rep cmpsd or even SSE comparisons. – schnaader – 2009-11-06T02:07:04.627

Re "C's memcmp is actually pretty slow because it tries to preserve useful string comparison semantics". You're of course discussing a particular implementation, but in general on modern CPUs memcmp is bound by the speed by which you can access memory. No algorithm can do better than that, provably. – MSalters – 2009-11-06T10:41:54.633

4What do you mean, "C's memcmp is actually pretty slow because it tries to preserve useful string comparison semantics"? This sounds more like strncmp() than memcmp(). Please show benchmarks showing that your version (which won't compile because it's not even C) is faster than a modern C standard library's built-in memcmp(). – Chris Lutz – 2009-11-06T22:18:27.160

I retagged it to "d". The code is all d, so i'm not sure why it's tagged "c". Feel free to change it back if it wasn't a typo, of course, but it would probably keep confusing readers and think your code is in error. – Johannes Schaub - litb – 2009-11-06T23:44:23.147

Wow, someone's gone on a downvote-everyone spree. – Crashworks – 2009-11-06T23:58:46.820

op i wonder what you think of my answer/code. – None – 2009-11-07T14:33:02.720

He's using D for the example, but the question is about a C library function and how to improve on it. – Jyaan – 2010-08-11T00:55:18.140

Answers

3

You could try:

  • check if uint is the largest type that fits in your target CPU (ulongs might fit a native register better)
  • use 2 pointers of that type
  • use 2 local variables using *p++ (dont dereference the pointers 2 times for 1 value)
  • devide the counter of the 1st loop up front (use while (counter--))
  • unroll the 2nd loop by replacing it with a switch (if sizeof(type that fits in a register) is known and will be always the same.)

Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:

uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint  l;
uint  r;
int count = (n / uint.sizeof) / 4;

while (count--) {
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
}

Of course that leaves (n / uint.sizeof) % 4 iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.

rsp

Posted 2009-11-06T01:49:13.900

Reputation: 20 041

>

  • I tried using ulongs, it was slower.
  • I tried taking out the second dereference. It wasn't any faster, propbably b/c the compiler optimizes to the same thing anyhow, so I put the 2nd deref back because it makes the code more readable.
  • I think most of the bottleneck is the first loop.
  • < – dsimcha – 2009-11-06T14:32:02.437

    Can the person who voted -1 on all answers add a note as to why? – rsp – 2009-11-07T08:29:36.803

    rsp I assume its just a troll. I dont think any of these answers deserve a downvote. – None – 2009-11-08T18:09:39.103

    2

    I dont know much about it but there are vector instructions which can apply instructions to many bytes at once. You can use those results to do a blazing fast memcmp. I dont know which instructions you would need but if you look up Larrabee New Instructions or check out this article you may find what you are looking for http://www.ddj.com/architect/216402188

    NOTE: This CPU isnt out ATM AFAIK

    -Edit- Right now i am positive there is an instruction sets (try looking at SSE or SSE2) which may compare 16 bytes at once if they are aligned.

    You can try this pure c++ code.

    template<class T>
    int memoryCompare(const T* lhs, const T * rhs, size_t n) {
        const T* endLHS = lhs + n/sizeof(T);
        while(lhs<endLHS) {
            int i = *lhs - *rhs;
            if(i != 0) return i > 0 ? 1 : -1;
            lhs++;
            rhs++;
        }
        //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
        return 0;
    }
    

    The advantage here is your increasing the pointer so you can dereference it and not apply an index (its ptr_offset[index] vs ptr_offset). The above uses a template so you can use 64 bits on 64 bit machines. and CMPs in assembly are really just subtracts checking the N and Z flags. Instead of comparing N and decreasing N i just compare in my version.

    user34537

    Posted 2009-11-06T01:49:13.900

    Reputation:

    It might help if you dereference lhs and rhs :-) – rsp – 2009-11-07T17:21:41.883

    oops, heh XD. Fixed. – None – 2009-11-08T18:06:49.297

    1

    I think memcmp is specified to do a byte-by-byte comparison, regardless of data type. Are you sure your compiler's implementation preserves string semantics? It shouldn't.

    Crashworks

    Posted 2009-11-06T01:49:13.900

    Reputation: 32 237

    1

    Well a lot depends upon your system and data. There is only so many assumptions we can make. What processor are you using? Does it have to be straight C code? How wide are the CPU registers? What is the cache structure of the CPU? etc etc

    It also depends on how different your data is. If it is very unlikely that the first byte from each buffer is the same then the speed of the function is pretty meaningless as, logically, it won't get to the rest of the function. If it is likely that the first n-1 bytes are usually the sme then it becomes more important.

    All in you are unlikely to see much change regardless of how you do the test.

    Anyway this is a little implementation of my own, it may, or may not, be faster than your own (or, given i just made it up as i went along, it may, or may not, work ;)):

    int memoryCompare(const void* lhs, const void* rhs, size_t n) 
    {
        uint_64 diff    = 0
    
        // Test the first few bytes until we are 32-bit aligned.
        while( (n & 0x3) != 0 && diff != 0 )
        {
            diff = (uint_8*)lhs - (uint_8*)rhs;
            n--;
            ((uint_8*)lhs)++;
            ((uint_8*)rhs)++;              
            }
    
        // Test the next set of 32-bit integers using comparisons with
        // aligned data.
        while( n > sizeof( uint_32 ) && diff != 0 )
        {
            diff = (uint_32*)lhs - (uint_32*)rhs;
            n   -= sizeof( uint_32 );
            ((uint_32*)lhs)++;
            ((uint_32*)rhs)++;
        }
    
        // now do final bytes.
        while( n > 0 && diff != 0 ) 
            {
            diff = (uint_8*)lhs - (uint_8*)rhs;
            n--;
            ((uint_8*)lhs)++;
            ((uint_8*)rhs)++;  
            }
    
        return (int)*diff / abs( diff ));
    }
    

    Goz

    Posted 2009-11-06T01:49:13.900

    Reputation: 50 107

    1

    Does the answer to this question help you at all?

    If the compiler has support for implementing memcmp() as an intrinsic/builtin function, it seems that you would have a hard time besting it.

    I have to admit to knowing little to nothing about D, so I have no idea whether the D compiler has support for intrinsics.

    sdtom

    Posted 2009-11-06T01:49:13.900

    Reputation: 851

    The OP already determined his alternative is faster then memcmp and asked for further improvements. If his compiler would use an intrinsic this probably would not be the case. – rsp – 2009-11-07T08:33:03.537

    1

    If you trust your compiler's optimisation you might try a few modifications to acidzombie24s suggestion:

    template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) {
        const T* endLHS = &lhs[n];
        while(lhs<endLHS) {
            //A good optimiser will keep these values in register
            //and may even be clever enough to just retest the flags
            //before incrementing the pointers iff we loop again.
            //gcc -O3 did the optimisation very well.
            if (*lhs > *rhs) return 1;
            if (*lhs++ < *rhs++) return -1;
        }
        //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
        return 0;
    }
    

    Here's the entire gcc -O3 optimised inner loop in x86 assembler code for a C version of the above, only passing char array pointers:

    Loop:
            incl    %eax         ; %eax is lhs
            incl    %edx         ; %edx is rhs
            cmpl    %eax, %ebx   ; %ebx is endLHS
            jbe     ReturnEq
            movb    (%edx), %cl
            cmpb    %cl, (%eax)
            jg      ReturnGT
            jge     Loop
    ReturnLT:
            ...
    

    Rocso

    Posted 2009-11-06T01:49:13.900

    Reputation: 11