## efficiently find amount of integers in a sorted array

10

1

I'm studying for a test, and found this question.

You are given a sorted array of integers for example:

``````{-5, -5, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 67, 67, 99}
``````

Write a method:

``````Public static int count (int[] a, int x)
``````

that returns the amount of times, number 'x' is in the array.

for example:

``````x = -5, it returns 2
x = 2, it returns 5
x = 8, it returns 0
``````

I need to write it as efficient as possible, please don't give me the answer (or write it if you wish, but I won't look), my idea is to do a binary search, and then go both edges (backward and forward) of the value that I find and with the index numbers, return the correct answer, my questions are:

1. is it the most efficient way?
2. won't it be O(n) in the worst case? (when the array is filled with a single number) -

If so - then why should I do a binary search?

8@ambigram_maker it would probably be a "modified" binary search where if you find your key, you still keep searching. – C.B. – 2014-04-02T14:33:00.077

If you are allowed to do so, check whether the number is present in the array. If not, return. Otherwise, I'd say use premature return to exit the method as soon as possible. – blueygh2 – 2014-04-02T14:33:09.707

there must be a way below O(n), it's a question about complexity, they won't give me a question that I can just go O(n) times and count the number when I find it, it's a twist of binary search in my opinion. – Seth Keno – 2014-04-02T14:33:29.877

2@SethKeno obviously your algorithm has to look at every element in the array. I don't think it gets better than O(n). The naive approach i can think of would rather be O(n^2) [for each element, look at it, then go through the array, counting that element] – fstd – 2014-04-02T14:34:27.893

@SethKeno, for a test the trivial solution is `O(n)` and your solution is also `O(n)` in the worst case. – Anthony Accioly – 2014-04-02T14:36:06.667

@fstd what about if I do a binary search, find the number, go backwards until I find a different number, and go forward until I find a different number (or array ends each direction), will it be less than O(n) or not? – Seth Keno – 2014-04-02T14:36:19.727

@user432 and how do you find where the 'next' element is, without going over it linearly? – fstd – 2014-04-02T14:36:35.423

@SethKeno I'm not sure what you have in mind, but note that the big-O notation specifies an upper bound of the complexity. Your algoritm, even if it on average might outperform the linear solution, sounds still like O(n) to me – fstd – 2014-04-02T14:38:18.590

@fstd why is the naive approach O(n^2)? it is sorted – Seth Keno – 2014-04-02T14:38:21.227

@SethKeno the naive approach would have a loop going over the array, and for each (not yet seen) element, it would again go over the array (i.e. two nested loops of complexity O(n) -- that's O(n^2). EDIT: admittedly it wouldn't have to go over the entire array all the time, but it still qualifies as O(n^2) (beause it's an upper bound, as said) – fstd – 2014-04-02T14:40:10.793

2@fstd Have you read the question? Even the most naive implementation (counting the # of occurences of ONE value in an array is no more than O(n)! And given that the array is sorted, this can certainly be solved in O(log(n)) using some divide and conquer approach. – Gyro Gearless – 2014-04-02T15:03:49.023

@GyroGearless I originally misread the question, as is evident in the comments of a different answer. Forgot to update this thread. (I originally read the question as if the goal was to get the frequency of /all/ distinct members in the array) – fstd – 2014-04-03T15:04:55.147

possible duplicate of Using Binary Search with sorted Array with duplicates

– nawfal – 2014-06-15T20:08:28.343

15

Modify your binary search to find the first and last occurrence of the given input, then the difference between those two indexes is the result.

To find a first and last occurrence using binary search you need to change a bit from the usual binary search algorithm. In binary search the value is returned when a match is found. But here unlike the usual binary search you need to continue searching until you find a mismatch.

A bit update

after you find the first occurrence then you can use that index as a starting point of the next binary search to find the last.

1You really should elaborate on how to do each of these binary searches in the answer itself. – Dukeling – 2014-04-02T14:47:37.050

hi @Dukeling .. those are already answered in other threads.. so I thought it would be enough to point those links – stinepike – 2014-04-02T14:48:24.540

Clever :). This is actually `O(log n)`. Anyway, considering branch predictions, I'm interested to know how large must `int[] a` be so that this approach starts beating the naive `O(n)` approach. I assume that in practice the `O(n)` method is still faster for small arrays. – Anthony Accioly – 2014-04-02T14:48:40.417

1Consider that this answer requires that you read at least 3 answers on different pages to see how to do this. Then consider that you can probably summarize both approaches in about 2 sentences in this answer (and provide the links for additional reference). To provide the maximum long-term value, it seems like a no-brainer to do the latter. – Dukeling – 2014-04-02T14:55:36.940

thanks @Dukeling for the advice , i am editing the answer a bit – stinepike – 2014-04-02T14:58:13.930

2@StinePike Two answers within a gap of 30 seconds. One gets 7 upvotes while the other is on 1. – Nikunj Banka – 2014-04-02T15:04:51.843

1@NikunjBanka SO is notoriously time-sensitive. Plus, SO crowds don't like long answers. Shorter answer will usually attract more upvotes, just because it can be skimmed through faster (that is my theory anyway). If we could specify code window's height, and you'd specify it like, say, 5 lines, I'm sure your answer would get more upvotes. :) – Will Ness – 2014-04-02T15:10:46.433

dont know man .. may be just luck :S .... @NikunjBanka – stinepike – 2014-04-02T15:12:04.367

@AnthonyAccioly - It would also be interesting to compare this against binary search to find the first occurrence, then linear search from there to count the number of occurrences. – mbeckish – 2014-04-02T20:22:16.513

@mbeckish ... yes it can be done .. but what if there is only single number ocurring in all elements ? .... in second case it will use O(n) – stinepike – 2014-04-03T02:15:54.873

1@StinePike - Yes, that would be the worst case. I would consider your solution to be best for large data sets with no a priori knowledge of the nature of the data. – mbeckish – 2014-04-03T18:59:17.493

5

Do binary Search to find the first occurence. Do binary search to find the last occurence. The number of occurences is equal to the number of numbers between the 2 indices found.

Working code:

``````public class Main {
public static void main(String[] args){
int[] arr = {-5, -5, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 67, 67, 99};
int lo = getFirst(arr, -5);
if(lo==arr.length){ // the number is not present in the array.
System.out.println(0);
}else{
int hi = getLast(arr, -5);
System.out.println((hi-lo+1));
}
}

// Returns last occurence of num or arr.length if it does not exists in arr.
static int getLast(int[] arr, int num){
int lo = 0, hi = arr.length-1, ans = arr.length;
while(lo<=hi){
int mid = (lo+hi)/2;
if(arr[mid]==num){
ans = mid;
lo = mid+1;
}else if(arr[mid]<num){
lo = mid+1;
}else if(arr[mid]>num){
hi = mid-1;
}
}
return ans;
}

// Returns first occurence of num or arr.length if it does not exists in arr.
static int getFirst(int[] arr, int num){
int lo = 0, hi = arr.length-1, ans = arr.length;
while(lo<=hi){
int mid = (lo+hi)/2;
if(arr[mid]==num){
ans = mid;
hi = mid-1;
}else if(arr[mid]<num){
lo = mid+1;
}else if(arr[mid]>num){
hi = mid-1;
}
}
return ans;
}
}
``````

Minor correction. You would need it `ans = -1;` to start with. – nawfal – 2014-06-15T20:27:30.193

4

Two solutions come to mind:

1) Do Binary Search alright, but maintain that invariant that it finds the first occurence. Then do a linear search. This will be Theta(log n + C) where C is the count.

Programming Pearls by Jon Bentley has a nice write up, where he mentions looking for the first occurence is actually more efficient than looking for any occurence.

2) You can also do two binary searches, one for the first occurence, and one for the last, and take the difference of the indices. This will be Theta(log n).

You can decide which solution to apply based on the expected value of C. If C = o(log n) (yes small o), then looking for the first occurence and doing linear search will be better. Otherwise do two binary searches.

If you don't know the expected value of C, then you might be better off with solution 2.

1

Actually there is a slightly better solution than the given solutions! It is a combination of two different ways to do binary search.

First you do a binary search to get the first occurence. This is O(log n)

Now, starting with the first index you just found, you do a different kind of binary search: You guess the frequency of that element F, by starting with a guess of F=1 and doubling your estimate and check if the element repeats. This is guaranteed to be O(log F) (where F is the frequency).

This way, the algorithm runs in O(log N + log F)

You do not have to worry about the distribution of the numbers!

0

IMHO this is the most efficient solution: Others may have mentioned a similiar approach, but I think this one is the easiest to explain and the easiest to understand, also it has a modification that will speed up the process in practice:

Basically the idea is the find the smallest and largest index of the occurence. Finding the smallest is O(log N) using binary search (using Newton's method to actually increase the performance in the average case is a possible improvement). If you don't know how to modify binary search to find the smallest index, the trivial modification is to look for the element with value `(p - 0.5)` - obviously you will not find that value in the integer array, but if the binary search terminates the index will be the one next to where the recursion stops. You will just need to check whether it exists at all. This will give you the smallest index.

Now in order to find the largest index, again you will have to start a binary search, this time using the smallest index as the lower bound and `(p + 0.5)` as the search target, this is guaranteed to be O(log N), in the average case it will be O(log N/2). Using Newton's method and taking the values of the upper and lower bounds into account will in practice improve the performance.

Once you have found the largest and smallest index, the difference between them obviously is the result.

For evenly distributed numbers, using the Newton modification will drastically improve the runtime (in the case of a continuous equidistant sequence of numbers it will be O(1) (two or three steps) to find the smallest and greatest values), although the theoretical complexity still is O(log N) for arbitrary input.