Median of a Matrix with sorted rows

10

8

I am not able to solve the following problem optimally nor finding an approach to do this anywhere.

Given a N × M matrix in which each row is sorted, find the overall median of the matrix. Assume N*M is odd.

For example,

Matrix =
[1, 3, 5]
[2, 6, 9]
[3, 6, 9]

A = [1, 2, 3, 3, 5, 6, 6, 9, 9]

Median is 5. So, we return 5.
Note: No extra memory is allowed.

Any help will be appreciated.

hatellla

Posted 2017-01-01T09:10:15.827

Reputation: 869

Answers

12

Consider the following process.

  • If we consider the N*M matrix as 1-D array then the median is the element of 1+N*M/2 th element.

  • Then consider x will be the median if x is an element of the matrix and number of matrix elements ≤ x equals 1 + N*M/2.

  • As the matrix elements in each row are sorted then you can easily find the number of elements in each row less than or equals x. For finding in the whole matrix, the complexity is N*log M with binary search.

  • Then first find the minimum and maximum element from the N*M matrix. Apply Binary Search on that range and run the above function for each x.

  • If the number of elements in matrix ≤ x is 1 + N*M/2 and x contains in that matrix then x is the median.

You can consider this below C++ Code :

int median(vector<vector<int> > &A) {
    int min = A[0][0], max = A[0][0];
    int n = A.size(), m = A[0].size();
    for (int i = 0; i < n; ++i) {
        if (A[i][0] < min) min = A[i][0];
        if (A[i][m-1] > max) max = A[i][m-1];
    }

    int element = (n * m + 1) / 2;
    while (min < max) {
        int mid = min + (max - min) / 2;
        int cnt = 0;
        for (int i = 0; i < n; ++i)
            cnt += upper_bound(&A[i][0], &A[i][m], mid) - &A[i][0];
        if (cnt < element)
            min = mid + 1;
        else
            max = mid;
    }
    return min;
}

sunkuet02

Posted 2017-01-01T09:10:15.827

Reputation: 1 726

1

A simple O(1) memory solution is to check if each individual element z is the median. To do this we find the position of z in all rows, just accumulating the number of elements smaller than z. This doesn't use the fact that each row is sorted except finding the position of z in each row in O(log M) time. For each element we need to do N*log M comparisons, and there are N*M elements, so it is N²M log M.

Saeed Amiri

Posted 2017-01-01T09:10:15.827

Reputation: 19 592

1

If the matrix elements are integers, one can binary search the median starting with the matrix range for hi and low. O(n log m log(hi-low)).

Otherwise, one way that has O(n²log²m) wost-case time complexity is to binary search, O(log m), for each row in turn, O(n), the closest element to the overall matrix median from the left and the closest from the right, O(n log m), updating the best so far. We know the overall median has no more than floor(m * n / 2) elements strictly less than it, and that adding the number of elements less than it and the number of times it occurs can be no less than floor(m * n / 2) + 1. We use standard binary search on the row, and – as greybeard pointed out – skip the test for elements outside our 'best' range. The test for how close an element is to the overall median involves counting how many elements in each row are strictly less than it and how many equal, which is achieved in O(n log m) time with n binary searches. Since the row is sorted, we know greater elements would be more "to the right" and lesser elements more "to the left" in relation to the overall median.

If one is permitted to rearrange the matrix, O(mn log (mn)) time complexity is possible by sorting the matrix in place (using block sort, for example) and returning the middle element.

גלעד ברקן

Posted 2017-01-01T09:10:15.827

Reputation: 11 892

1

There is a randomized algorithm that solves this problem in O(n (log n) (log m)) time. It is a Las Vegas algorithm, which means it always give correct results but might take longer than expected. In this case, the probability that it takes much longer than expected is extremely small.

When m = 1, this problem reduces to the problem of finding the median in a read-only array using constant space. That problem does not have a known optimal solution: see "Finding median in read-only memory on integer input, Chan et al."

One odd thing about this reduction of the problem when m = 1 is that this subcase is also a supercase, in that an algorithm for m = 1 can be applied to the m > 1 case. The idea is to just forget that the array rows are sorted and treat the entire storage area as an unsorted array of size n * m. So, for instance, the trivial algorithm for the m = 1 case, in which each element is checked to see if it is the median, takes O(n2) time. Applying it when m > 1 takes O(n2m2) time.

Going back to the m = 1 case, in the comparison model (in which the items of the array can be integers, strings, real numbers, or anything else that can be compared with the inequality operators "<", ">"), the best known deterministic solution that uses space s (where s is a constant, i.e. in O(1)) has time ϴ(2ss!n1 + 1/s), and it is more complex than the usual algorithms discussed on stackoverflow (though not on https://cstheory.stackexchange.com or https://cs.stackexchange.com). It uses a chained sequence of algorithms As, As-1, ..., A1, where As+1 calls As. You can read it in "Selection from read-only memory and sorting with minimum data movement", by Munro and Raman.

There is a simple randomized algorithm with a smaller run time with high probability. For any constant c, this algorithm runs in time O(n log n) with probability 1 - O(n-c). When the array is the matrix of size n*m that works out to O(n m log (n m)).

This algorithm is very much like quickselect without the rearranging of elements during partitioning.

import random

def index_range(needle, haystack):
  """The index range' of a value over an array is a pair
  consisting of the number of elements in the array less
  than that value and the number of elements in the array
  less than or equal to the value.
  """
  less = same = 0
  for x in haystack:
    if x < needle: less += 1
    elif x == needle: same += 1
  return less, less + same

def median(xs):
  """Finds the median of xs using O(1) extra space. Does not
  alter xs.
  """
  if not xs: return None
  # First, find the minimum and maximum of the array and
  # their index ranges:
  lo, hi = min(xs), max(xs)
  lo_begin, lo_end = index_range(lo, xs)
  hi_begin, hi_end = index_range(hi, xs)
  # Gradually we will move the lo and hi index ranges closer
  # to the median.
  mid_idx = len(xs)//2
  while True:
    print "range size", hi_begin - lo_end
    if lo_begin <= mid_idx < lo_end:
      return lo
    if hi_begin <= mid_idx < hi_end:
      return hi
    assert hi_begin - lo_end > 0
    # Loop over the array, inspecting each item between lo
    # and hi. This loops sole purpose is to reservoir sample
    # from that set. This makes res a randomly selected
    # element from among those strictly between lo and hi in
    # xs:
    res_size = 0
    res = None
    for x in xs:
      if lo < x < hi:
        res_size += 1
        if 1 == random.randint(1, res_size):
          res = x
    assert res is not None
    assert hi_begin - lo_end == res_size
    # Now find which size of the median res is on and
    # continue the search on the smaller region:
    res_begin, res_end = index_range(res, xs)
    if res_end > mid_idx:
      hi, hi_begin, hi_end = res, res_begin, res_end
    else:
      lo, lo_begin, lo_end = res, res_begin, res_end

It works by maintaining upper and lower bounds on the value of the median. It then loops over the array and randomly selects a value between the bounds. That value replaces one of the bounds and the process starts again.

The bounds are accompanied by their index range, a measure of which indexes the bound would appear at if the array were sorted. Once one of the bounds would appear at the index ⌊n/2⌋, it is the median and the algorithm terminates.

When an element is randomly selected in the gap between the bounds, this reduces the gap by 50% in expectation. The algorithm terminates (at the latest) when the gap is 0. We can model this as a series of random independent uniformly distributed variables Xi from (0,1) such that Yk = X1 * X2 * ... * Xk where Xi is the ratio of the gap that remains after round i. For instance, if after the 10th round the gap between the index ranges of lo and hi is 120, and after the 11th round the gap is 90, then X11 = 0.75. The algorithm terminates when Yk < 1/n, because the gap is then less than 1.

Pick a constant positive integer k. Let's bound the probability that Yk log2n >= 1/n using Chernoff bounds. We have Yk log2n = X1 * X2 * ... Xk log2n, so ln Yk log2n = ln X1 + ln X2 + ... + ln Xk log2n. The Chernoff bound then gives Pr(ln X1 + ln X2 + ... + ln Xk log2n >= ln (1/n)) <= mint > 0 e-t ln (1/n) (E[et ln X1] * E[et ln X2] * ... * E[et ln Xk log2 n]). After some simplification, the right-hand side is mint > 0 nt (E[X1t] * E[X2t] * ... * E[Xk log2 nt]). Since this is a minimum and we are looking for an upper bound, we can weaken this by specializing to t = 1. It then simplifies to n1-k, since E[Xi] = 1/2.

If we pick, for instance, k = 6, then this bounds the probability that there are 6 log2n rounds or more by n-5. So with probability 1 - O(n-5) the algorithm performs 6 log2n - 1 or fewer rounds. This is what I mean by "with high probability" above.

Since each round inspects every member of the array a constant number of times, each round takes linear time, for a total running time of O(n log n) with high probability. When the array is not just an array but a matrix of size n * m that works out to O(n m log (n m)).

We can do substantially better, however, by taking advantage of the sortedness of the rows. When we were working in a single unsorted array, finding the elements in the gap I referenced above required inspecting each element of the array. In a matrix with sorted rows, the elements in the gap are located in a contiguous segment of each row. Each segment can be identified in O(log m) time using binary search, so they can all be located in O(n log m) time. The reservoir sampling now takes O(n log m) time per iteration of the loop.

The other main work done in the loop is to identify the index range of the element from the gap that was randomly selected. Again, because each row is sorted, the index range for the randomly-chosen element in a row can be determined in O(log m) time. The sums of the index ranges for each row constitute the index range over the whole array, so this part of each loop iteration also takes only O(n log m) time.

By the same argument as above with the Chernoff bound, there are O(log n) iterations with probability at least 1-O(n-k) for any constant k. Thus the whole algorithm takes O(n (log n) (log m)) time with high probability.

import bisect
import random

def matrix_index_range(needle, haystack):
  """matrix_index_range calculates the index range of needle
  in a haystack that is a matrix (stored in row-major order)
  in which each row is sorted"""
  n, m = len(haystack), len(haystack[0])
  begin = end = 0;
  for x in haystack:
    begin += bisect.bisect_left(x, needle)
    end += bisect.bisect_right(x, needle)
  return begin, end

def matrix_median(xs):
  print "Starting"
  if not xs or not xs[0]: return None
  n, m = len(xs), len(xs[0])
  lo, hi = xs[0][0], xs[0][m-1]
  for x in xs:
    lo, hi = min(lo, x[0]), max(hi, x[m-1])
  lo_begin, lo_end = matrix_index_range(lo, xs)
  hi_begin, hi_end = matrix_index_range(hi, xs)
  mid_idx = (n * m) // 2
  while True:
    print "range size", hi_begin - lo_end
    if lo_begin <= mid_idx < lo_end:
      return lo
    if hi_begin <= mid_idx < hi_end:
      return hi
    assert hi_begin - lo_end > 0
    mid = None
    midth = random.randint(0, hi_begin - lo_end - 1)
    for x in xs:
      gap_begin = bisect.bisect_right(x, lo)
      gap_end = bisect.bisect_left(x, hi)
      gap_size = gap_end - gap_begin
      if midth < gap_size:
        mid = x[gap_begin + midth]
        break
      midth -= gap_size
    assert mid is not None
    mid_begin, mid_end = matrix_index_range(mid, xs)
    assert lo_end <= mid_begin and mid_end <= hi_begin
    if mid_end > mid_idx:
      hi, hi_begin, hi_end = mid, mid_begin, mid_end
    else:
      lo, lo_begin, lo_end = mid, mid_begin, mid_end

This solution is substantially faster than the first one when m is non-constant.

jbapple

Posted 2017-01-01T09:10:15.827

Reputation: 2 651

1

I have coded the O(n2 log2 m) time solution of גלעד ברקן, but they have asked me to not add the code to their answer, so here it is as a separate answer:

import bisect

def MedianDistance(key, matrix):
  lo = hi = 0
  for row in matrix:
    lo += bisect.bisect_left(row, key)
    hi += bisect.bisect_right(row, key)
  mid = len(matrix) * len(matrix[0]) // 2;
  if hi - 1 < mid: return hi - 1 - mid
  if lo > mid: return lo - mid
  return 0

def ZeroInSorted(row, measure):
  lo, hi = -1, len(row)
  while hi - lo > 1:
    mid = (lo + hi) // 2
    ans = measure(row[mid])
    if ans < 0: lo = mid
    elif ans == 0: return mid
    else: hi = mid

def MatrixMedian(matrix):
  measure = lambda x: MedianDistance(x, matrix)
  for idx, row in enumerate(matrix):
    if not idx & idx-1: print(idx)
    ans = ZeroInSorted(row, measure)
    if ans is not None: return row[ans]

jbapple

Posted 2017-01-01T09:10:15.827

Reputation: 2 651

0

sunkuet02's answer with refinements and python code:
Each row of the N×M matrix A is sorted and has a middle element, which is its median.
There are at least N*(M+1)/2 elements no larger than the maximum hi of these medians, and at least N*(M+1)/2 no smaller than the minimum lo:
the median of all elements of A must be between lo and hi, inclusive.
As soon as more than half the elements are known to be lower than the current candidate, the latter is known to be high. As soon as there are too few rows remaining for the count of elements lower than the current candidate to reach half the total, the candidate is known to be low: in both cases, immediately proceed to the next candidate.

from bisect import bisect

def median(A):
    """ returns the median of all elements in A.
        Each row of A needs to be in ascending order. """
    # overall median is between min and max row median
    lo, hi = minimax(A)
    n = len(A)
    middle_row = n // 2
    columns = len(A[0])
    half = (n * columns + 1) // 2
    while lo < hi:
        mid = lo + (hi - lo) // 2
        lower = 0
        # first half can't decide median
        for a in A[:middle_row]:
            lower += bisect(a, mid)
        # break as soon as mid is known to be too high or low
        for r, a in enumerate(A[middle_row:n-1]):
            lower += bisect(a, mid)
            if half <= lower:
                hi = mid
                break
            if lower < r*columns:
                lo = mid + 1
                break
        else: # decision in last row
            lower += bisect(A[n-1], mid)
            if half <= lower:
                hi = mid
            else:
                lo = mid + 1

    return lo


def minmax(x, y):
    """return min(x, y), max(x, y)"""
    if x < y:
        return x, y
    return y, x


def minimax(A):
    """ return min(A[0..m][n//2]), max(A[0..m][n//2]):
        minimum and maximum of medians if A is a
        row major matrix with sorted rows."""
    n = len(A)
    half = n // 2
    if n % 2:
        lo = hi = A[0][half]
    else:
        lo, hi = minmax(A[0][half], A[1][half])
    for i in range(2-n % 2, len(A[0]), 2):
        l, h = minmax(A[i][half], A[i+1][half])
        if l < lo:
            lo = l
        if hi< h:
            hi = h
    return lo, hi


if __name__ =='__main__':
    print(median( [[1, 3, 5], [2, 6, 9], [3, 6, 9]] ))

(I consider std::upper_bound() and bisect.bisect() to be equivalent (bisect_right() is an alias).)
For the second candidate median, the last row processed may be lower than in the first iteration. In following iterations, that rownumber should never decrease - too lazy to factor that in ((rename and) increase middle_row as appropriate).

greybeard

Posted 2017-01-01T09:10:15.827

Reputation: 1 261