Estimating the maximum element of a large list

Description of the problem

Assume we have a large unordered list of integers with repetitions. We would like to pick the largest integer in this list without going through the whole list. How many elements $M$ do we have to go through so that the largest of the first $M$ elements is the largest of the list with probability $p$?

A solution

Assume we listed $D$ distinct integers within the first $M$ terms. Assuming all terms appear equally likely within the list, the probability that maximum value not showing up in the first $M$ entries is \[ \left(1 - \frac{1}{D+1}\right)^M \] We want this number be equal to $1-p$. Thus \[ M = \left\lceil\frac{\log(1-p)}{\log\left(\frac{D}{D+1}\right)}\right\rceil \] For example, for a large list of (estimated) 100 distinct integers, in order to guess the largest item with 99% accuracy we need to read 459 items.

There is nothing specific about the maximum: one can replace it with any reducer function acting on the set of distinct elements obtained from the first $M$ elements in our list.

An implementation in Common Lisp

Here is an implementation of probabilistic calculation of the reducer function applied to the correct beginning portion of a large list.

(defun calculate (xs fn p &optional (n 1) (buffer (list (car xs))))
   (let* ((d (length buffer))
          (q (- 1.0 (expt (/ d (+ 1.0 d)) n))))
     (if (or (null xs) (> q p))
        (list :result (reduce fn buffer) :distinct d :checked n :probability q)
        (calculate (cdr xs) fn p (1+ n) (union buffer (list (car xs)))))))

CALCULATE

The list of distinct items is kept in buffer. The number n indicates how many items read from the list so far.

Let us test this on an example list: below we calculate the maximum of a large list of integers.

(calculate (loop repeat 100000 collect (random 100)) #'max 0.99)

(RESULT 99 DISTINCT 99 CHECKED 459 PROBABILITY 0.990079)

The code below finds the average of the list of integers which fluctuate around $10$ with an error $\pm 2$:

(let* ((xs (loop repeat 10000 collect (+ 10 (- (random 5) 2))))
       (n (length xs))
       (res (calculate xs #'+ 0.999))
       (ave (/ (getf res :result) (getf res :distinct))))
   (append (list :average ave)  (cddr res)))

(AVERAGE 10 DISTINCT 5 CHECKED 38 PROBABILITY 0.9990203)

Here is a worst case scenario: the average of a list of floats which fluctuate around $1.0$ with an error $\pm 0.5$:

(let* ((xs (loop repeat 10000 collect (+ 1.0 (- (random 1.0) 0.5))))
       (n (length xs))
       (res (calculate xs #'+ 0.999))
       (ave (/ (getf res :result) (getf res :distinct))))
   (append (list :average ave)  (cddr res)))

(AVERAGE 0.9995118 DISTINCT 9995 CHECKED 10001 PROBABILITY 0.6322368)

The number of distinct items in the list is close to the number of terms in the list. We went through the whole list. Even then the probability that we get the right thing is not satisfactory. [Can you guess why? Think of all floats with this property. How many are there?]

An imperative version

I am fond of recursion. IMHO Recursive functions do explain most computations in a very succint way. But, you don’t have to buy into this. Here is a more imperative version of the implementation above in case recursion doesn’t do it for you

(defun calculate (xs fn p)
   (do* ((n 0 (1+ n))
         (ys xs (cdr ys))
         (buffer nil (union buffer (list (car ys))))
         (d 0 (length buffer))
         (q 0.0 (- 1.0 (expt (/ d (+ 1.0 d)) n))))
        ((or (null ys) (> q p))
         (list :result (reduce fn buffer)
               :distinct d
               :checked n
               :probability q))))

CALCULATE

An implementation in clojure

And finally, here is an implementation in clojure in case Common Lisp doesn’t do it for you:

(defn calculate [xs f p]
   (loop [ys xs
          buffer #{}
          n 0
          q 0.0]
     (if (or (empty? ys) (> q p))
        {:result (reduce f buffer)
         :distinct (count buffer)
         :checked n
         :probability q}
        (recur (rest ys)
               (clojure.set/union buffer #{(first ys)})
               (inc n)
               (- 1.0 (Math/pow (- 1.0 (/ (+ 1.0 (count buffer)))) n))))))