Binary search for no uniform distribution

16

4

The binary search is highly efficient for uniform distributions. Each member of your list has equal 'hit' probability. That's why you try the center each time.

Is there an efficient algorithm for no uniform distributions ? e.g. a distribution following a 1/x distribution.

ic3

Posted 2013-06-01T12:22:41.700

Reputation: 4 998

1Do you mean the distribution of the data, or the distribution of your queries? – krlmlr – 2013-06-01T12:29:02.823

let's say distribution of the queries (not sure if it makes a difference). The fact is that you're more likely to find the answer near start than at the end. – ic3 – 2013-06-01T12:33:25.647

1You may be better off keeping the binary search but adding a cache, i.e., keeping a list of recent queries and checking it for the current query each time. – MegaWidget – 2013-06-01T22:05:56.697

You might be looking for Splay Trees

– recursion.ninja – 2013-12-01T19:17:57.970

can you give a usecase where keys in a dictionary have 1/x distribution – arunmoezhi – 2015-05-08T19:42:01.017

A Zipfian distribution approximates many real world situations but classically frequencies of words. The frequency of a word is inversely proportional to its rank, as well as to its length. The normal assumption would be thus that the dictionary is in rank order rather than dictionary order. However lexicographic order with length outranking alphabetic will tend to a 1/x distribution. This ordering is used for counting arguments in theory of computation, but there is evidence that human lexical ordering has such properties. – David M W Powers – 2015-11-21T13:41:05.267

Answers

9

There's a deep connection between binary search and binary trees - binary tree is basically a "precalculated" binary search where the cutting points are decided by the structure of the tree, rather than being chosen as the search runs. And as it turns out, dealing with probability "weights" for each key is sometimes done with binary trees.

One reason is because it's a fairly normal binary search tree but known in advance, complete with knowledge of the query probabilities.

Niklaus Wirth covered this in his book "Algorithms and Data Structures", in a few variants (one for Pascal, one for Modula 2, one for Oberon), at least one of which is available for download from his web site.

Binary trees aren't always binary search trees, though, and one use of a binary tree is to derive a Huffman compression code.

Either way, the binary tree is constructed by starting with the leaves separate and, at each step, joining the two least likely subtrees into a larger subtree until there's only one subtree left. To efficiently pick the two least likely subtrees at each step, a priority queue data structure is used - perhaps a binary heap.

A binary tree that's built once then never modified can have a number of uses, but one that can be efficiently updated is even more useful. There are some weight-balanced binary tree data structures out there, but I'm not familiar with them. Beware - the term "weight balanced" is commonly used where each node always has weight 1, but subtree weights are approximately balanced. Some of these may be adaptable for varied node weights, but I don't know for certain.

Anyway, for a binary search in an array, the problem is that it's possible to use an arbitrary probability distribution, but inefficient. For example, you could have a running-total-of-weights array. For each iteration of your binary search, you want to determine the half-way-through-the-probability distribution point, so you determine the value for that then search the running-total-of-weights array. You get the perfectly weight-balanced next choice for your main binary search, but you had to do a complete binary search into your running total array to do it.

The principle works, however, if you can determine that weighted mid-point without searching for a known probability distribution. The principle is the same - you need the integral of your probability distribution (replacing the running total array) and when you need a mid-point, you choose it to get an exact centre value for the integral. That's more an algebra issue than a programming issue.

One problem with a weighted binary search like this is that the worst-case performance is worse - usually by constant factors but, if the distribution is skewed enough, you may end up with effectively a linear search. If your assumed distribution is correct, the average-case performance is improved despite the occasional slow search, but if your assumed distribution is wrong you could pay for that when many searches are for items that are meant to be unlikely according to that distribution. In the binary tree form, the "unlikely" nodes are further from the root than they would be in a simply balanced (flat probability distribution assumed) binary tree.

A flat probability distribution assumption works very well even when it's completely wrong - the worst case is good, and the best and average cases must be at least that good by definition. The further you move from a flat distribution, the worse things can be if actual query probabilities turn out to be very different from your assumptions.

Steve314

Posted 2013-06-01T12:22:41.700

Reputation: 21 297

While I agree that, as opposed to standard binary search, using probability in searching like in my answer improve average-case performance at the cost of the worst-case performance. Please note that such statistical approach has better and better average-case performance as the problem size grows. This is because the statistical prediction becomes more accurate as there are more samples; Assuming that you really know the distribution of course.

Having worse worst-case is also not a reason to not use the algorithm. Take the practicality of hashing for example. – Billiska – 2013-06-02T11:44:17.040

@Billinska - true, but there are cases where you have to consider worst-case performance - unless you're really lucky and all the air molecules happen to leave the room by pure fluke, ironically suffocating the people who're insisting that even an infinitesimal chance can still happen. – Steve314 – 2013-06-02T15:10:58.013

A flat probability distribution sometimes isn't good enough, and the tree model is not necessarily appropriate. – David M W Powers – 2015-11-21T13:52:12.580

@DavidMWPowers - not good enough for what? The only thing I claimed it was good enough for was giving an O(log n) worst-case bound irrespective of how accurate your probability distribution assumption (I probably should have said "prediction") turns out to be. Even Billiska's point about "better and better average-case performance as the problem size grows" is false if an adversary knows what your algorithm is doing, can predict what will trigger worst-case behaviour in advance and deliberately constructs a long series of worst-case queries. Denial-of-service attacks have been based on that. – Steve314 – 2015-11-21T14:36:00.767

@DavidMWPowers - beyond that, my best guess at what you're objecting to is that the probability distribution can change over time (and thus a fixed tree isn't adequate) but the question didn't ask for that. – Steve314 – 2015-11-21T14:41:39.800

@Steve314 - O(log n) average performance isn't always good enough. When n is big, there are applications where that won't do - as mentioned in other comments. O(1) is sometimes desirable and O(loglog n) is sometimes achievable - either in amortized sense, or abandoning overlong searches. A tree is unlikely to perform at this level. – David M W Powers – 2015-11-21T15:08:40.947

@David M W Powers - OK, but I don't think OP asked the question you seem to think I should have answered. OP asked about an efficient alternative to binary search for non-uniform distributions, and that's the question I tried to answer. For O(1), hash tables would work - but aren't in general an alternative to binary search because they don't preserve order. – Steve314 – 2015-11-21T16:12:53.147

@David M W Powers - As O(log n) is optimal for a comparison-based search without extra clues, either your O(log log n) claim is similarly not general for what OP was asking for or false. I note that some search structures claim to be O(log log n)-competitive, but only to achieve amortized O(log n) expected performance. I admit being rusty on my theory, so I'm not quite sure how that works. As for abandoning overlong searches - again obviously not always valid. – Steve314 – 2015-11-21T16:19:54.530

@Steve314 - the distributional information is an extra clue and we do have an O(loglog n) expected case, but if we are careful we can still retain the O(log n) worst case. Once distribution is compensated for so that we have a uniform distribution which for which we use proportional binary search (assume scaled so draws from 1..N into1..N with replacement). This can be modeled by a Poisson distribution on duplicates and drops, a Skellam distribution on distances or a Random Walk. In each case σ and μ are o(√N). Halving: log(N/2)=logN-1, Rooting: log(√N)=logN/2. Thus to reduce to 1 is loglog N. – David M W Powers – 2015-11-22T00:28:29.777

@Steve314 - the answer to OPs algorithm is no NO but YES, there is an alternate algorithm. He asked specifically about binary search, and the proportionally weighted binary search does promise better performance, although technically it still has an O(log n) worst case. There are also sorted hash tables which have an O(1) expected case and where proportional probing will achieve expected O(loglog n) resolution in bursty areas Having a lower load factor actually reduces the constant involved exponentially (Poisson distribution). – David M W Powers – 2015-11-22T00:39:23.760

Let us continue this discussion in chat.

– Steve314 – 2015-11-22T04:56:10.113

4

Let me make it precise. What you want for binary search is:

 Given array A which is sorted, but have non-uniform distribution
 Given left & right index L & R of search range
 Want to search for a value X in A

 To apply binary search, we want to find the index M in [L,R] 
 as the next position to look at.

 Where the value X should have equal chances to be in either range [L,M-1] or [M+1,R]

In general, you of course want to pick M where you think X value should be in A. Because even if you miss, half the total 'chance' would be eliminated.

So it seems to me you have some expectation about distribution. If you could tell us what exactly do you mean by '1/x distribution', then maybe someone here can help build on my suggestion for you.


Let me give a worked example.

I'll use similar interpretation of '1/x distribution' as @Leonid Volnitsky

Here is a Python code that generate the input array A

from random import uniform

# Generating input
a,b = 10,20
A = [ 1.0/uniform(a,b) for i in range(10) ]
A.sort()

# example input (rounded)
# A = [0.0513, 0.0552, 0.0562, 0.0574, 0.0576, 0.0602, 0.0616, 0.0721, 0.0728, 0.0880]

Let assume the value to search for is:

X = 0.0553

Then the estimated index of X is:

= total number of items * cummulative probability distribution up to X
= length(A) * P(x <= X)

So how to calculate P(x <= X) ? It this case it is simple. We reverse X back to the value between [a,b] which we will call

X' = 1/X ~ 18

Hence

P(x <= X) = (b-X')/(b-a)
          = (20-18)/(20-10)
          = 2/10

So the expected position of X is:

10*(2/10) = 2

Well, and that's pretty damn accurate!

To repeat the process on predicting where X is in each given section of A require some more work. But I hope this sufficiently illustrate my idea.

I know this might not seems like a binary search anymore if you can get that close to the answer in just one step. But admit it, this is what you can do if you know the distribution of input array.

Billiska

Posted 2013-06-01T12:22:41.700

Reputation: 923

Sorry, for editing the answer into quite a different direction. it's not median thingy anymore. – Billiska – 2013-06-01T12:59:02.660

1Actually I liked the version before your edit better. If the data is uniform and the queries are not, you want to choose the median of the query distribution as pivot. This has to be translated into an array position, though. – krlmlr – 2013-06-01T13:09:38.010

3

The purpose of a binary search is that, for an array that is sorted, every time you half the array you are minimizing the worst case, e.g. the worst possible number of checks you can do is log2(entries). If you do some kind of an 'uneven' binary search, where you divide the array into a smaller and larger half, if the element is always in the larger half you can have worse worst case behaviour. So, I think binary search would still be the best algorithm to use regardless of expected distribution, just because it has the best worse case behaviour.

Patashu

Posted 2013-06-01T12:22:41.700

Reputation: 18 146

2The thing is, if he knows the distribution of his input array. He can predict where the value being searched for is likely to be in the array. Hence, if he repeatedly search at that point, the search will be terminated much faster. – Billiska – 2013-06-01T13:10:12.997

In some cases the worst case doesn't matter and the average case of logN is too expensive. The proportional expected case of loglogN (if uniform or transformed to uniform) is what we need in many cases (e.g. in language processing or web search). For example in real life, if we can't find the word/document in time, we may indeed miss the opportunity, or we will often substitute a suboptimal (or sometimes even better) alternate. – David M W Powers – 2015-11-21T13:49:57.447

3

You have a vector of entries, say [x1, x2, ..., xN], and you're aware of the fact that the distribution of the queries is given with probability 1/x, on the vector you have. This means your queries will take place with that distribution, i.e., on each consult, you'll take element xN with higher probability.

This causes your binary search tree to be balanced considering your labels, but not enforcing any policy on the search. A possible change on this policy would be to relax the constraint of a balanced binary search tree -- smaller to the left of the parent node, greater to the right --, and actually choosing the parent nodes as the ones with higher probabilities, and their child nodes as the two most probable elements.

Notice this is not a binary search tree, as you are not dividing your search space by two in every step, but rather a rebalanced tree, with respect to your search pattern distribution. This means you're worst case of search may reach O(N). For example, having v = [10, 20, 30, 40, 50, 60]:

        30
      /    \
    20      50
   /       /  \
 10       40   60

Which can be reordered, or, rebalanced, using your function f(x) = 1 / x:

f([10, 20, 30, 40, 50, 60]) = [0.100, 0.050, 0.033, 0.025, 0.020, 0.016]
sort(v, f(v)) = [10, 20, 30, 40, 50, 60]

Into a new search tree, that looks like:

        10  -------------> the most probable of being taken
      /    \               leaving v = [[20, 30], [40, 50, 60]]
    20      30  ---------> the most probable of being taken
           /  \            leaving v = [[40, 50], [60]]
          40   50 -------> the most probable of being taken
              /            leaving v = [[60]]
             60

If you search for 10, you only need one comparison, but if you're looking for 60, you'll perform O(N) comparisons, which does not qualifies this as a binary search. As pointed by @Steve314, the farthest you go from a fully balanced tree, the worse will be your worst case of search.

Rubens

Posted 2013-06-01T12:22:41.700

Reputation: 10 554

2

I will assume from your description:

  • X is uniformly distributed
  • Y=1/X is your data which you want to search and it is stored in sorted table
  • given value y, you need to binary search it in the above table

Binary search usually uses value in center of range (median). For uniform distribution it is possible to to speed up search by knowing approximately where in the table to we need to look for searched value.

For example if we have uniformly distributed values in [0,1] range and query is for 0.25, it is best to look not in center of range but in 1st quarter of the range.

To use the same technique for 1/X data, store in table not Y but inverse 1/Y. Search not for y but for inverse value 1/y.

Leonid Volnitsky

Posted 2013-06-01T12:22:41.700

Reputation: 5 294

I like you interpretation of '1/x distribution'. but still trying to understand the rest. – Billiska – 2013-06-01T13:35:15.700

It doesn't matter whether you have Y or 1/Y in the table, but yes this would give you the appropriate jumps (and might be more efficient due to not taking repeated reciprocals). – David M W Powers – 2015-11-21T13:45:14.237

1

Unweighted binary search isn't even optimal for uniformly distributed keys in expected terms, but it is in worst case terms.

The proportionally weighted binary search (which I have been using for decades) does what you want for uniform data, and by applying an implicit or explicit transform for other distributions. The sorted hash table is closely related (and I've known about this for decades but never bothered to try it).

In this discussion I will assume that the data is uniformly selected from 1..N and in an array of size N indexed by 1..N. If it has a different solution, e.g. a Zipfian distribution where the value is proportional to 1/index, you can apply an inverse function to flatten the distribution, or the Fisher Transform will often help (see Wikipedia).

Initially you have 1..N as the bounds, but in fact you may know the actual Min..Max. In any case we will assume we always have a closed interval [Min,Max] for the index range [L..R] we are currently searching, and initially this is O(N). We are looking for key K and want index I so that

[I-R]/[K-Max]=[L-I]/[Min-K]=[L-R]/[Min-Max] e.g. I = [R-L]/[Max-Min]*[Max-K] + L.

Round so that the smaller partition gets larger rather than smaller (to help worst case). The expected absolute and root mean square error is <√[R-L] (based on a Poisson/Skellam or a Random Walk model - see Wikipedia). The expected number of steps is thus O(loglogN).

The worst case can be constrained to be O(logN) in several ways. First we can decide what constant we regard as acceptable, perhaps requiring steps 1. Proceeding for loglogN steps as above, and then using halving will achieve this for any such c.

Alternatively we can modify the standard base b=B=2 of the logarithm so b>2. Suppose we take b=8, then effectively c~b/B. we can then modify the rounding above so that at step k the largest partition must be at most N*b^-k. Viz keep track of the size expected if we eliminate 1/b from consideration each step which leads to worst case b/2 lgN. This will however bring our expected case back to O(log N) as we are only allowed to reduce the small partition by 1/b each time. We can restore the O(loglog N) expectation by using simple uprounding of the small partition for loglogN steps before applying the restricted rounding. This is appropriate because within a burst expected to be local to a particular value, the distribution is approximately uniform (that is for any smooth distribution function, e.g. in this case Skellam, any sufficiently small segment is approximately linear with slope given by its derivative at the centre of the segment).

As for the sorted hash, I thought I read about this in Knuth decades ago, but can't find the reference. The technique involves pushing rather than probing - (possibly weighted binary) search to find the right place or a gap then pushing aside to make room as needed, and the hash function must respect the ordering. This pushing can wrap around and so a second pass through the table is needed to pick them all up - it is useful to track Min and Max and their indexes (to get forward or reverse ordered listing start at one and track cyclically to the other; they can then also be used instead of 1 and N as initial brackets for the search as above; otherwise 1 and N can be used as surrogates).

If the load factor alpha is close to 1, then insertion is expected O(√N) for expected O(√N) items, which still amortizes to O(1) on average. This cost is expected to decrease exponentially with alpha - I believe (under Poisson assumptions) that μ ~ σ ~ √[Nexp(α)].

The above proportionally weighted binary search can used to improve on the initial probe.

David M W Powers

Posted 2013-06-01T12:22:41.700

Reputation: 241