Which is faster, Hash lookup or Binary search?

62

28

When given a static set of objects (static in the sense that once loaded it seldom if ever changes) into which repeated concurrent lookups are needed with optimal performance, which is better, a HashMap or an array with a binary search using some custom comparator?

Is the answer a function of object or struct type? Hash and/or Equal function performance? Hash uniqueness? List size? Hashset size/set size?

The size of the set that I'm looking at can be anywhere from 500k to 10m - incase that information is useful.

While I'm looking for a C# answer, I think the true mathematical answer lies not in the language, so I'm not including that tag. However, if there are C# specific things to be aware of, that information is desired.

TheSoftwareJedi

Posted 2008-12-11T16:48:12.290

Reputation: 25 435

Depends on the hash function's level of perfection. – jmucchiello – 2009-11-09T22:53:25.683

1What is "lookup"? Do you only want to test membership (whether a particular element exists or not)? Or do you have key-value pairs, and want to find the value associated with some key? – ShreevatsaR – 2008-12-11T17:12:43.190

Answers

20

Ok, I'll try to be short.

C# short answer:

Test the two different approaches.

.NET gives you the tools to change your approach with a line of code. Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.

Longer answer:

An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.

To get to an item, your hashtable will do something like this:

  • Get the hash of the key
  • Get the bucket number for that hash (usually the map function looks like this bucket = hash % bucketsCount)
  • Traverse the items chain (basically it's a list of items that share the same bucket, most hashtables use this method of handling bucket/hash collisions) that starts at that bucket and compare each key with the one of the item you are trying to add/delete/update/check if contained.

Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.

A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table

Maghis

Posted 2008-12-11T16:48:12.290

Reputation: 928

47

For very small collections the difference is going to be negligible. At the low end of your range (500k items) you will start to see a difference if you're doing lots of lookups. A binary search is going to be O(log n), whereas a hash lookup will be O(1), amortized. That's not the same as truly constant, but you would still have to have a pretty terrible hash function to get worse performance than a binary search.

(When I say "terrible hash", I mean something like:

hashCode()
{
    return 0;
}

Yeah, it's blazing fast itself, but causes your hash map to become a linked list.)

ialiashkevich wrote some C# code using an array and a Dictionary to compare the two methods, but it used Long values for keys. I wanted to test something that would actually execute a hash function during the lookup, so I modified that code. I changed it to use String values, and I refactored the populate and lookup sections into their own methods so it's easier to see in a profiler. I also left in the code that used Long values, just as a point of comparison. Finally, I got rid of the custom binary search function and used the one in the Array class.

Here's that code:

class Program
{
    private const long capacity = 10_000_000;

    private static void Main(string[] args)
    {
        testLongValues();
        Console.WriteLine();
        testStringValues();

        Console.ReadLine();
    }

    private static void testStringValues()
    {
        Dictionary<String, String> dict = new Dictionary<String, String>();
        String[] arr = new String[capacity];
        Stopwatch stopwatch = new Stopwatch();

        Console.WriteLine("" + capacity + " String values...");

        stopwatch.Start();

        populateStringArray(arr);

        stopwatch.Stop();
        Console.WriteLine("Populate String Array:      " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        populateStringDictionary(dict, arr);

        stopwatch.Stop();
        Console.WriteLine("Populate String Dictionary: " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        Array.Sort(arr);

        stopwatch.Stop();
        Console.WriteLine("Sort String Array:          " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        searchStringDictionary(dict, arr);

        stopwatch.Stop();
        Console.WriteLine("Search String Dictionary:   " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        searchStringArray(arr);

        stopwatch.Stop();
        Console.WriteLine("Search String Array:        " + stopwatch.ElapsedMilliseconds);

    }

    /* Populate an array with random values. */
    private static void populateStringArray(String[] arr)
    {
        for (long i = 0; i < capacity; i++)
        {
            arr[i] = generateRandomString(20) + i; // concatenate i to guarantee uniqueness
        }
    }

    /* Populate a dictionary with values from an array. */
    private static void populateStringDictionary(Dictionary<String, String> dict, String[] arr)
    {
        for (long i = 0; i < capacity; i++)
        {
            dict.Add(arr[i], arr[i]);
        }
    }

    /* Search a Dictionary for each value in an array. */
    private static void searchStringDictionary(Dictionary<String, String> dict, String[] arr)
    {
        for (long i = 0; i < capacity; i++)
        {
            String value = dict[arr[i]];
        }
    }

    /* Do a binary search for each value in an array. */
    private static void searchStringArray(String[] arr)
    {
        for (long i = 0; i < capacity; i++)
        {
            int index = Array.BinarySearch(arr, arr[i]);
        }
    }

    private static void testLongValues()
    {
        Dictionary<long, long> dict = new Dictionary<long, long>(Int16.MaxValue);
        long[] arr = new long[capacity];
        Stopwatch stopwatch = new Stopwatch();

        Console.WriteLine("" + capacity + " Long values...");

        stopwatch.Start();

        populateLongDictionary(dict);

        stopwatch.Stop();
        Console.WriteLine("Populate Long Dictionary: " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        populateLongArray(arr);

        stopwatch.Stop();
        Console.WriteLine("Populate Long Array:      " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        searchLongDictionary(dict);

        stopwatch.Stop();
        Console.WriteLine("Search Long Dictionary:   " + stopwatch.ElapsedMilliseconds);

        stopwatch.Reset();
        stopwatch.Start();

        searchLongArray(arr);

        stopwatch.Stop();
        Console.WriteLine("Search Long Array:        " + stopwatch.ElapsedMilliseconds);
    }

    /* Populate an array with long values. */
    private static void populateLongArray(long[] arr)
    {
        for (long i = 0; i < capacity; i++)
        {
            arr[i] = i;
        }
    }

    /* Populate a dictionary with long key/value pairs. */
    private static void populateLongDictionary(Dictionary<long, long> dict)
    {
        for (long i = 0; i < capacity; i++)
        {
            dict.Add(i, i);
        }
    }

    /* Search a Dictionary for each value in a range. */
    private static void searchLongDictionary(Dictionary<long, long> dict)
    {
        for (long i = 0; i < capacity; i++)
        {
            long value = dict[i];
        }
    }

    /* Do a binary search for each value in an array. */
    private static void searchLongArray(long[] arr)
    {
        for (long i = 0; i < capacity; i++)
        {
            int index = Array.BinarySearch(arr, arr[i]);
        }
    }

    /**
     * Generate a random string of a given length.
     * Implementation from https://stackoverflow.com/a/1344258/1288
     */
    private static String generateRandomString(int length)
    {
        var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
        var stringChars = new char[length];
        var random = new Random();

        for (int i = 0; i < stringChars.Length; i++)
        {
            stringChars[i] = chars[random.Next(chars.Length)];
        }

        return new String(stringChars);
    }
}

Here are the results with several different sizes of collections. (Times are in milliseconds.)

500000 Long values...
Populate Long Dictionary: 26
Populate Long Array: 2
Search Long Dictionary: 9
Search Long Array: 80

500000 String values...
Populate String Array: 1237
Populate String Dictionary: 46
Sort String Array: 1755
Search String Dictionary: 27
Search String Array: 1569

1000000 Long values...
Populate Long Dictionary: 58
Populate Long Array: 5
Search Long Dictionary: 23
Search Long Array: 136

1000000 String values...
Populate String Array: 2070
Populate String Dictionary: 121
Sort String Array: 3579
Search String Dictionary: 58
Search String Array: 3267

3000000 Long values...
Populate Long Dictionary: 207
Populate Long Array: 14
Search Long Dictionary: 75
Search Long Array: 435

3000000 String values...
Populate String Array: 5553
Populate String Dictionary: 449
Sort String Array: 11695
Search String Dictionary: 194
Search String Array: 10594

10000000 Long values...
Populate Long Dictionary: 521
Populate Long Array: 47
Search Long Dictionary: 202
Search Long Array: 1181

10000000 String values...
Populate String Array: 18119
Populate String Dictionary: 1088
Sort String Array: 28174
Search String Dictionary: 747
Search String Array: 26503

And for comparison, here's the profiler output for the last run of the program (10 million records and lookups). I highlighted the relevant functions. They pretty closely agree with the Stopwatch timing metrics above.

Profiler output for 10 million records and lookups

You can see that the Dictionary lookups are much faster than binary search, and (as expected) the difference is more pronounced the larger the collection. So, if you have a reasonable hashing function (fairly quick with few collisions), a hash lookup should beat binary search for collections in this range.

Bill the Lizard

Posted 2008-12-11T16:48:12.290

Reputation: 287 653

1md5 would be totally inappropriate as a hash to look up values in a hash table. It's a cryptographic hash. – Bill the Lizard – 2008-12-11T17:10:26.940

11Not 'totally inappropriate', just slow. And even good non-cryptographic hash functions can indeed be slower than binary search for small-ish sizes. – Nick Johnson – 2008-12-11T20:49:37.513

1Yep, the default string hash is such a terrible hash function. If keys are long, the hash will be much slower than the average compare. – Stephan Eggermont – 2008-12-11T21:36:14.640

2small correction - O(1) on average for random data and good hash function. Not O(1) amortized. – orip – 2008-12-11T21:47:35.477

1

@orip: No, it's amortized. http://en.wikipedia.org/wiki/Amortized

– Bill the Lizard – 2008-12-11T22:02:07.227

@Stephan: The default string hash for what language? – Bill the Lizard – 2008-12-11T22:03:19.470

C# getHashCode() – Stephan Eggermont – 2008-12-11T22:31:02.613

Surely you don't mean that getHashCode() is slower than md5? So, you must mean it produces a lot more collisions than md5. Hmmm... I think I might understand your earlier point now. md5 might not be totally bad for an input set that size. – Bill the Lizard – 2008-12-11T23:36:27.503

2No, getHashCode is slower than compare. A lot slower for long strings. – Stephan Eggermont – 2008-12-11T23:46:11.073

@bill - the worst case sequence of operations for a hash table is when all the keys hash to the same bucket, and operations are O(N). Amortized time refers to the average cost per operation in worst case behavior (your wikipedia link says it too). – orip – 2008-12-12T00:00:24.537

@orip: The worst case for a hash table doesn't mean you assume the worst possible hash function. – Bill the Lizard – 2008-12-12T01:43:45.063

If you really need to protect yourself from weird data - you can have a hash table where a bucket can be a binary search tree, or another hash table (with a different hash function). So your worst time will still be O(log N) – Zar Shardan – 2013-03-03T01:11:16.343

That said I would think that in most current implementations a simple variable size array buckets (like std:vector) would be used because their O(n) search is faster than BST's O(log N) for small N's due to CPU cache effects (Locality of reference, spatial/sequential) – Zar Shardan – 2013-03-03T01:18:25.510

24It's a little shocking this was upvoted so much since this answer is simply wrong - it's quite common for binary search to be faster than a hashtable. log n is a rather small factor, and can easily be outweighed by caching effects, constant scaling factors and whatnot for any size data - after all, that data needs to fit in this universe; and practically speaking no datastructures are likely to contain more than 2^64 items, and probably no more than 2^30 before you start looking at perf a little more specifically. – Eamon Nerbonne – 2013-03-15T09:58:02.350

You get expected O(1) amortized if you pick a new (sufficiently random) hash function any time you have a collision and you keep the tables a constant factor larger than the number of entries. – Captain Segfault – 2009-02-03T00:32:52.090

There is no right or wrong answer, Nothing is GUARANTEED, O(1) for hash lookup in theory, does it work in practice, most of the time it won't unlike whats advertised. Same for binary trees, but you can implement your own methods to achieve O(1) lookup like others mentioned in partitioning into different blocks, the tradeoff would be more space complexity more like n^3. – user1529412 – 2015-02-13T16:15:14.153

A hash table is O(1), when you really analyse the detail, ultimately because the width of the hash is constant. This imposes a constant limit on the size of the table. The hash width is obviously chosen to suit machine limits at the time, but the algorithms for hash functions are "sticky" - right now, there are I bet a few people scratching their heads for why their containers aren't scaling well on 64-bit machines, having long since forgotten that 32-bit custom hash function. Theoretical nit-picking aside, hashing is fast, so long as you don't need to access items in order. – Steve314 – 2011-02-09T20:33:44.517

@orip - whatever your hash function, it's always bad for some data. Unless you know exactly what data you will encounter in the wild, the worst case still exists - only probability prevents it happening all the time, which is the same as the distinction between the worst case n^2 and the average n log n for quicksort. – Steve314 – 2011-02-09T21:25:46.017

36

The answers by Bobby, Bill and Corbin are wrong. O(1) is not slower than O(log n) for a fixed/bounded n:

log(n) is constant, so it depends on the constant time.

And for a slow hash function, ever heard of md5?

The default string hashing algorithm probably touches all characters, and can be easily 100 times slower than the average compare for long string keys. Been there, done that.

You might be able to (partially) use a radix. If you can split up in 256 approximately same size blocks, you're looking at 2k to 40k binary search. That is likely to provide much better performance.

[Edit] Too many people voting down what they do not understand.

String compares for binary searching sorted sets have a very interesting property: they get slower the closer they get to the target. First they will break on the first character, in the end only on the last. Assuming a constant time for them is incorrect.

Stephan Eggermont

Posted 2008-12-11T16:48:12.290

Reputation: 13 774

Valid point too. – Keltia – 2008-12-11T16:54:58.843

Corbin, please take a look at what big O notation means – Stephan Eggermont – 2008-12-11T17:00:55.167

9@Stephan: We all three said O(1) is faster than O(log n). You also need to look at what big O notation means. It compares the relative resource usage of algorithms as the input size is changing. It's meaningless to talk about a fixed n. – Bill the Lizard – 2008-12-11T17:09:28.643

The question was "which is better". I assume that means "which takes less time". No need to quibble about whether n is constant. – Mike Dunlavey – 2008-12-11T17:11:20.293

1Er... @Mike: n being constant matters a lot. O(log n) can be much faster than O(1) if the n is constant and small the constant-time operation in the O(1) takes a long time. But O(log n) is incredibly unlikely to be faster than O(1) if n is not constant. – Claudiu – 2008-12-11T17:24:12.233

Thanks, Claudiu. I just thought the question was pretty simple. N is bounded, so the only remaining question is the what are the fixed costs of lookup. Unless the fixed cost of hash is pretty bad, hash will be faster. – Mike Dunlavey – 2008-12-11T17:27:36.733

@Mike: You're right, if n is constant then only the fixed costs of hashing vs. lookup matter. However, n being constant is a bit different from n being bounded between 500k and 10M. 20:1 is enough room for log n to make a difference. – Bill the Lizard – 2008-12-11T17:35:39.073

I couldn't agree more. – Mike Dunlavey – 2008-12-11T17:55:36.910

I have found, in some cases (not necessarily this one) that if the dataset seldom changes, it is possible to generate and compile ad-hoc code to do at least part of the search, and shrink the lookup cost by a good factor. – Mike Dunlavey – 2008-12-11T17:58:30.230

BTW I'm upvoting this answer because it's not that bad, really :-) In fact the suggestion of using a radix is quite sensible. – Mike Dunlavey – 2008-12-11T19:01:56.123

If I could paraphrase Stephan's point: O(10M) = O(1) – Mike Dunlavey – 2008-12-11T20:20:48.193

+1, stephan is right: N <= 10 million means O(N) = O(1). – orip – 2008-12-11T21:44:15.157

@orip: That's just wrong. A binary search might be faster when n=500k, but be outperformed by a hash when n=10M. If the input size matters, you can't ignore the term. – Bill the Lizard – 2008-12-11T22:07:05.890

@Bill: the question was about a nearly-not changing set. Of course the hash could be faster, but it could also have 20 times more collisions. You have to compare actual implementations. – Stephan Eggermont – 2008-12-11T22:27:24.100

@Stephan: You don't use a hash table under the assumption that you have a lot of collisions. Any reasonable hash code implementation will give better performance than a binary tree for 500k - 10M objects. – Bill the Lizard – 2008-12-11T22:39:01.770

Let's just say I've seen some unreasonable hash implementations then. – Stephan Eggermont – 2008-12-11T22:43:37.377

I can agree with that. I saw one that could be simplified to "return 0;", after many complicated steps of combining and mixing. – Bill the Lizard – 2008-12-11T23:39:47.217

@bill - if you want to talk actual timing, time it and see. If you want to talk computational complexity: N <= (any constant) means O(N) = O(1). It's two different topics, really. – orip – 2008-12-11T23:57:06.167

I'm detecting violent agreement :-) – Mike Dunlavey – 2008-12-12T02:11:35.427

2Actually the point about string compare getting slower as one gets closer to the target is not inherent in binary search, because it is possible to keep track of the common prefix as you narrow down the subset. (Not that anybody does.) – Mike Dunlavey – 2008-12-12T02:17:25.337

@orip: I already explained why that's wrong. We're comparing two algorithms, not just stating the complexity of one. – Bill the Lizard – 2008-12-12T04:10:43.253

@Mike Dunlavey: +1 Interesting point. Any idea what the real complexity of a binary string search would be, on data with a certain distribution of characters (for simplicity, assume 36 equally-probably characters), if the search were written to exploit common prefixes? – supercat – 2010-12-11T17:29:43.290

@supercat: Well, you could have a string compare routine that just said "ignore the first n characters because we know they're equal", and when it returned a non-equal comparison, it could also return the number of equal characters (in a by-ref argument). This is basically what happens when you search in a trie, so it's still just your basic logN search. – Mike Dunlavey – 2010-12-12T00:16:40.990

The real problems I've seen had to do with product identifiers which mostly differ only in the last few positions. – Stephan Eggermont – 2010-12-12T16:51:16.393

@Mike Dunlavey: If there were no saving of string offset between comparisons, the time required to search for a string in a list of N strings of M characters each would be O(MlogN). Clearly the time can't be less than the larger of O(M) and O(NlogN), but such a lower bound would be less than O(MlogN). Any idea whether there are in fact any savings in O() terms? – supercat – 2010-12-13T15:56:49.477

@supercat: You meant "larger of O(M) and O(logN)" right? OK, I can imagine a dictionary in which M was very much larger than logN. Like say, a dictionary of mammal genomes :) So I guess you're right, it's the larger of the two, but it's not O(MlogN). – Mike Dunlavey – 2010-12-13T16:33:11.650

@Mike Dunlavey: You're right about what I meant. So does keeping track of the string position save an O(M) factor, if M increases with N but is not too large relative to it? – supercat – 2010-12-13T23:18:45.743

@supercat: Yeah, that's the way I see it. Another way to think of it is, suppose the dictionary were a trie. It only takes O(logN) comparisons to find the unique entry, but it takes O(M) comparisons to see if that entry is equal to the one you're looking for. (The one you're looking for might differ from the entry you found, in the last character.) Of course, if M is known in advance, then O(M) = O(1), just to confuse things :) – Mike Dunlavey – 2010-12-14T02:02:45.303

"The default string hashing algorithm probably touches all characters, and can be easily 100 times slower than the average compare for long string keys. Been there, done that." – Firegarden – 2016-08-29T00:29:12.057

@StephanEggermont thank you for this answer. The number of iterations is only one consideration in performance, for smaller n the lookup time for a binary search could very likely outperform the hash map. – Justin Meiners – 2017-03-26T18:20:54.827

17

The only reasonable answer to this question is: It depends. It depends on the size of your data, the shape of your data, your hash implementation, your binary search implementation, and where your data lives (even though it's not mentioned in the question). A couple other answers say as much, so I could just delete this. However, it might be nice to share what I've learned from feedback to my original answer.

  1. I wrote, "Hash algorithms are O(1) while binary search is O(log n)." - As noted in the comments, Big O notation estimates complexity, not speed. This is absolutely true. It's worth noting that we usually use complexity to get a sense of an algorithm's time and space requirements. So, while it's foolish to assume complexity is strictly the same as speed, estimating complexity without time or space in the back of your mind is unusual. My recommendation: avoid Big O notation.
  2. I wrote, "So as n approaches infinity..." - This is about the dumbest thing I could have included in an answer. Infinity has nothing to do with your problem. You mention an upper bound of 10 million. Ignore infinity. As the commenters point out, very large numbers will create all sorts of problems with a hash. (Very large numbers don't make binary search a walk in the park either.) My recommendation: don't mention infinity unless you mean infinity.
  3. Also from the comments: beware default string hashes (Are you hashing strings? You don't mention.), database indexes are often b-trees (food for thought). My recommendation: consider all your options. Consider other data structures and approaches... like an old-fashioned trie (for storing and retrieving strings) or an R-tree (for spatial data) or a MA-FSA (Minimal Acyclic Finite State Automaton - small storage footprint).

Given the comments, you might assume that people who use hash tables are deranged. Are hash tables reckless and dangerous? Are these people insane?

Turns out they're not. Just as binary trees are good at certain things (in-order data traversal, storage efficiency), hash tables have their moment to shine as well. In particular, they can be very good at reducing the number of reads required to fetch your data. A hash algorithm can generate a location and jump straight to it in memory or on disk while binary search reads data during each comparison to decide what to read next. Each read has the potential for a cache miss which is an order of magnitude (or more) slower than a CPU instruction.

That's not to say hash tables are better than binary search. They're not. It's also not to suggest that all hash and binary search implementations are the same. They're not. If I have a point, it's this: both approaches exist for a reason. It's up to you to decide which is best for your needs.

Original answer:


Hash algorithms are O(1) while binary search is O(log n). So as n approaches infinity, hash performance improves relative to binary search. Your mileage will vary depending on n, your hash implementation, and your binary search implementation.

Interesting discussion on O(1). Paraphrased:

O(1) doesn't mean instantaneous. It means that the performance doesn't change as the size of n grows. You can design a hashing algorithm that's so slow no one would ever use it and it would still be O(1). I'm fairly sure .NET/C# doesn't suffer from cost-prohibitive hashing, however ;)

Corbin March

Posted 2008-12-11T16:48:12.290

Reputation: 23 084

1Don't know why this was downvoted - good answer and an interesting point. +1. – xan – 2008-12-11T17:09:33.150

10-1: Big O notation measures complexity, not speed relative to other algorithms. The claim that hashes are O(1) and therefore faster than O(log n) binary searches is not strictly correct. – Juliet – 2008-12-11T18:18:46.460

3And not even practically correct. Default string hashes touch the whole string and can be a lot slower than compares. – Stephan Eggermont – 2008-12-11T21:15:16.557

2@Stephan: Agreed! Good alternatives are string length + hash of first 8 characters or length + hash of first 4 + last 4. Anything but using the whole thing. – Zan Lynx – 2010-08-12T18:44:07.950

-1: So as n approaches infinity, hash performance improves relative to binary search. Completely false. Actually it's the opposite because of collisions. Ever wondered why most DBs don't implement hash indexes? Because the B-tree index is simply faster for big enough data. – freakish – 2014-10-13T13:12:37.360

1@Corbin - but the width of the hash imposes a constant limit on the size of the table anyway, which doesn't exist for binary search. Forget to replace your old 32-bit hash function and maybe your hash table will simply stop working before that O(1) vs. O(log n) becomes relevant. If you factor in the need for wider hashes as tables get larger, you basically end up back at O(log n) where n is the maximum number of keys in the table (rather than the number of items actually present, as with a binary tree). Of course this is a criticism of the theory - hashing usually is faster in practice. – Steve314 – 2011-02-09T20:41:42.483

@Stephan - that's a claimed advantage of ternary trees - basically a binary tree for the first character, then another binary tree for the second character, and so on. Most string searches fail early, and don't need to touch most of the string, but a hashtable scans the whole string irrespective. Since string compares also tend to find a mismatch early, the same applies to normal binary trees - but of course only as long as the most-strings-aren't-found condition holds, which in reality isn't as often as the n-combinations-of-m-characters calculation suggests. – Steve314 – 2011-02-09T20:46:30.933

7

If your set of objects is truly static and unchanging, you can use a perfect hash to get O(1) performance guaranteed. I've seen gperf mentioned a few times, though I've never had occasion to use it myself.

Mark Ransom

Posted 2008-12-11T16:48:12.290

Reputation: 220 724

1If you can place a constant upper bound on the size of any algorithm or data structure, you can claim an O(1) bound for its performance. This is often done in reality - e.g. the performance for searching within a node of a B-tree is considered constant, since (irrespective of linear search or binary search) the maximum size of a node is constant. +1 for a good suggestion, but for the O(1) claim, I think you're cheating a bit. – Steve314 – 2011-02-09T21:21:19.687

@Steve314, I think you miss the point of a perfect hash. By customizing the hash function you are guaranteed to have no collisions, so it truly is one operation to reach the data once you have its hash, plus one comparison to make sure you weren't searching for something not in the table. – Mark Ransom – 2011-02-09T22:12:08.253

but my point is that you customise the hash for a particular and constant amount of data. You are quite right about the advantages of a perfect hash, but since it cannot cope with varying n (or even with varying the data within the n, for that matter) it's still cheating. – Steve314 – 2011-02-09T22:45:30.817

6

Hashes are typically faster, although binary searches have better worst-case characteristics. A hash access is typically a calculation to get a hash value to determine which "bucket" a record will be in, and so the performance will generally depend on how evenly the records are distributed, and the method used to search the bucket. A bad hash function (leaving a few buckets with a whole lot of records) with a linear search through the buckets will result in a slow search. (On the third hand, if you're reading a disk rather than memory, the hash buckets are likely to be contiguous while the binary tree pretty much guarantees non-local access.)

If you want generally fast, use the hash. If you really want guaranteed bounded performance, you might go with the binary tree.

David Thornley

Posted 2008-12-11T16:48:12.290

Reputation: 49 917

Again - downvoted :S. Good points, +1. – xan – 2008-12-11T17:10:32.180

trees also have degenerated cases that effectively turn into a list. most variations have strict invariants to avoid these, of course. – Javier – 2008-12-11T17:50:04.880

1Misleading answer. The performance problem often breaking hashing in practice is the hash function, not the collisions. – Stephan Eggermont – 2008-12-11T23:41:58.210

@Javier - practical binary trees (AVL, red-black etc) don't have those degenerate cases. That said, neither do some hash tables, since the collision-handling strategy is a choice. IIRC, the developer of D used an (unbalanced) binary tree scheme for handling hashtable collisions for Dscript, and got significantly improved average-case performance by doing so. – Steve314 – 2011-02-09T20:54:11.160

5

Surprised nobody mentioned Cuckoo hashing, which provides guaranteed O(1) and, unlike perfect hashing, is capable of using all of the memory it allocates, where as perfect hashing can end up with guaranteed O(1) but wasting the greater portion of its allocation. The caveat? Insertion time can be very slow, especially as the number of elements increases, since all of the optimization is performed during the insertion phase.

I believe some version of this is used in router hardware for ip lookups.

See link text

ApplePieIsGood

Posted 2008-12-11T16:48:12.290

Reputation: 1 917

1Perfect hashing can use all memory it allocates. It often doesn't because of the work involved in finding such a perfect perfect hash function, but for small datasets, it's perfectly doable. – Steve314 – 2011-02-09T22:58:14.283

4

Dictionary/Hashtable is using more memory and takes more time to populate comparing to array. But search is done faster by Dictionary rather than Binary Search within array.

Here are the numbers for 10 Million of Int64 items to search and populate. Plus a sample code you can run by yourself.

Dictionary Memory: 462,836

Array Memory: 88,376

Populate Dictionary: 402

Populate Array: 23

Search Dictionary: 176

Search Array: 680

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace BinaryVsDictionary
{
    internal class Program
    {
        private const long Capacity = 10000000;

        private static readonly Dictionary<long, long> Dict = new Dictionary<long, long>(Int16.MaxValue);
        private static readonly long[] Arr = new long[Capacity];

        private static void Main(string[] args)
        {
            Stopwatch stopwatch = new Stopwatch();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                Dict.Add(i, i);
            }

            stopwatch.Stop();

            Console.WriteLine("Populate Dictionary: " + stopwatch.ElapsedMilliseconds);

            stopwatch.Reset();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                Arr[i] = i;
            }

            stopwatch.Stop();

            Console.WriteLine("Populate Array:      " + stopwatch.ElapsedMilliseconds);

            stopwatch.Reset();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                long value = Dict[i];
//                Console.WriteLine(value + " : " + RandomNumbers[i]);
            }

            stopwatch.Stop();

            Console.WriteLine("Search Dictionary:   " + stopwatch.ElapsedMilliseconds);

            stopwatch.Reset();

            stopwatch.Start();

            for (long i = 0; i < Capacity; i++)
            {
                long value = BinarySearch(Arr, 0, Capacity, i);
//                Console.WriteLine(value + " : " + RandomNumbers[i]);
            }

            stopwatch.Stop();

            Console.WriteLine("Search Array:        " + stopwatch.ElapsedMilliseconds);

            Console.ReadLine();
        }

        private static long BinarySearch(long[] arr, long low, long hi, long value)
        {
            while (low <= hi)
            {
                long median = low + ((hi - low) >> 1);

                if (arr[median] == value)
                {
                    return median;
                }

                if (arr[median] < value)
                {
                    low = median + 1;
                }
                else
                {
                    hi = median - 1;
                }
            }

            return ~low;
        }
    }
}

ialiashkevich

Posted 2008-12-11T16:48:12.290

Reputation: 509

3

I strongly suspect that in a problem set of size ~1M, hashing would be faster.

Just for the numbers:

a binary search would require ~ 20 compares (2^20 == 1M)

a hash lookup would require 1 hash calculation on the search key, and possibly a handful of compares afterwards to resolve possible collisions

Edit: the numbers:

    for (int i = 0; i < 1000 * 1000; i++) {
        c.GetHashCode();
    }
    for (int i = 0; i < 1000 * 1000; i++) {
        for (int j = 0; j < 20; j++)
            c.CompareTo(d);
    }

times: c = "abcde", d = "rwerij" hashcode: 0.0012 seconds. Compare: 2.4 seconds.

disclaimer: Actually benchmarking a hash lookup versus a binary lookup might be better than this not-entirely-relevant test. I'm not even sure if GetHashCode gets memoized under-the-hood

Jimmy

Posted 2008-12-11T16:48:12.290

Reputation: 64 764

2With a decent optimizer the results should be 0 for both. – Stephan Eggermont – 2008-12-11T21:23:32.827

2

I'd say it depends mainly on the performance of the hash and compare methods. For example, when using string keys that are very long but random, a compare will always yield a very quick result, but a default hash function will process the entire string.

But in most cases the hash map should be faster.

Michael Borgwardt

Posted 2008-12-11T16:48:12.290

Reputation: 291 858

1there's no reason the hash function has to use the whole string. – Javier – 2008-12-11T17:48:16.880

Just a very practical one, you don't want all extensions of a string to end up in the same bucket (unless you use it as a kind of radix, and remove the prefix from the bucket elements, converting it into a trie-like structure) – Stephan Eggermont – 2008-12-11T23:27:42.967

2

I wonder why no one mentioned perfect hashing.

It's only relevant if your dataset is fixed for a long time, but what it does it analyze the data and construct a perfect hash function that ensures no collisions.

Pretty neat, if your data set is constant and the time to calculate the function is small compared to the application run time.

orip

Posted 2008-12-11T16:48:12.290

Reputation: 46 167

1

It depends on how you handle duplicates for hash tables (if at all). If you do want to allow hash key duplicates (no hash function is perfect), It remains O(1) for primary key lookup but search behind for the "right" value may be costly. Answer is then, theorically most of the time, hashes are faster. YMMV depending on which data you put there...

Keltia

Posted 2008-12-11T16:48:12.290

Reputation: 12 579

1“no hash function is perfect” – no, that's wrong. There's such a thing as perfect hashing, with a very wide area of application. The simplest case is of course a degenerate hash function h(x) = x. Notice that this is a valid hash function and there are quite some cases where this is used. – Konrad Rudolph – 2008-12-11T21:32:03.947

1@Konrad - Perfect hashes are only perfect within a very specific context. In reality, "perfect" is a name, not really a description. There's no such thing as a perfect-for-all-purposes hash. That said, the odds of a real-world problem using some well-known standard hash functions are extremely low, except in the specific case of a malicious adversary exploiting knowledge of which hash function was used. – Steve314 – 2011-02-09T22:52:11.137

1

Here it's described how hashes are built and because the Universe of keys is reasonably big and hash functions are built to be "very injective" so that collisions rarely happen the access time for a hash table is not O(1) actually ... it's something based on some probabilities. But,it is reasonable to say that the access time of a hash is almost always less than the time O(log_2(n))

xxxxxxx

Posted 2008-12-11T16:48:12.290

Reputation: 2 267

0

The answer depends. Lets think that the number of elements 'n' is very large. If you are good at writing a better hash function which lesser collisions, then hashing is the best. Note that The hash function is being executed only once at searching and it directs to the corresponding bucket. So it is not a big overhead if n is high.
Problem in Hashtable: But the problem in hash tables is if the hash function is not good (more collisions happens), then the searching isn't O(1). It tends to O(n) because searching in a bucket is a linear search. Can be worst than a binary tree. problem in binary tree: In binary tree, if the tree isn't balanced, it also tends to O(n). For example if you inserted 1,2,3,4,5 to a binary tree that would be more likely a list. So, If you can see a good hashing methodology, use a hashtable If not, you better using a binary tree.

Lahiru

Posted 2008-12-11T16:48:12.290

Reputation: 874

0

Of course, hash is fastest for such a big dataset.

One way to speed it up even more, since the data seldom changes, is to programmatically generate ad-hoc code to do the first layer of search as a giant switch statement (if your compiler can handle it), and then branch off to search the resulting bucket.

Mike Dunlavey

Posted 2008-12-11T16:48:12.290

Reputation: 35 343

Special casing the first layer is definitely a thing to try. – Stephan Eggermont – 2008-12-11T23:50:13.293

I guess I've got a soft spot for code generation, if only because none of the major popular "methodologies" can tell you when it's a win. – Mike Dunlavey – 2008-12-12T02:10:05.900

2I have a code generator that generates nested switch statements for a decision tree. Sometimes it generates gotos (because strictly it's a decision acyclic digraph). But "switch" isn't an algorithm. The compiler might use a hard-coded binary search, or a lookup table (structured in one of several ways - maybe a simple array, possibly a hashtable, maybe a binary-searched array), or whatever. I may be overreaching here - the hard-coded binary search and simple array both definitely exist in real-world compilers, but beyond that - compilers do a good job, and that's enough. – Steve314 – 2011-02-09T23:21:09.990

@Steve314: You're doing it the way I would. "switch" creates a jump table if the cases are suitably contiguous, and that's an algorithm. I've never heard of a compiler generating an if-tree for a switch, but that would be terrific if it did, and that's another algorithm. Anyway, code generation can be a really big win. It depends on the "table" you're searching being relatively static. – Mike Dunlavey – 2011-02-10T00:10:29.730

@Mike - I can't remember now for certain whether it was GCC or VC++ (most likely GCC), but I've seen the if-tree in a disassembly of generated code. As for relatively static, my code generator is doing multiple dispatch, and the set of possible implementations for the polymorphic function is of course completely static at run-time. It's not good for separate compilation, though, as you need to know all the cases to build the decision tree. There are languages that do that with separate compilation, but they build their decision trees/tables at run-time (e.g. on first call). – Steve314 – 2011-02-10T20:27:57.093

@Steve314: I seem to remember a graphics terminal from decades ago that when it wanted to draw a line, with all kinds of options, like width, color, dotted, etc. it would gen some machine language on the stack and run that. I was never that crazy. I did write a DB translator once that had to re-gen about once a week. That really impressed the client, with speed of operation. – Mike Dunlavey – 2011-02-10T20:33:31.640

@Mike - I've not done it, but generating machine code at run-time isn't that rare. I've been tempted to use LLVM and it's JIT compiler on occasion, but always resisted so far. LLVM is actually pretty easy to use, but of course there's a huge amount going on behind the scenes - that's a big overhead for "I could shave a few cycles off each call to that by generating a special-case version in advance". – Steve314 – 2011-02-10T20:43:51.557

@Steve314: Not to belabor, but the time saved was not just execution time, which was powers of 10. It also saved development time, by roughly an order of magnitude. I've got some theories why that was so, but it seems to always be true. – Mike Dunlavey – 2011-02-10T20:51:35.493

@Mike - I can believe it. I'm not sure about the "always true", but I certainly believe that code generation gives you a chance to automate working around the limitations of your language, so you solve the basic problem once instead of perhaps dozens of times even for a single project. Even generating a simple virtual-machine-code can make life easier in some cases (e.g. the "interpreter pattern"). What I meant is that LLVM is a pretty big library, and the cases where I was tempted really don't justify that overhead. – Steve314 – 2011-02-10T21:11:32.330