Assume we have a large unordered list of integers with repetitions. We would like to pick the largest integer in this list without going through the whole list. How many elements \(M\) do we have to go through so that the largest of the first \(M\) elements is the largest of the list with probability \(p\)?
Assume we listed \(D\) distinct integers within the first \(M\) terms. Assuming all terms appear equally likely within the list, the probability that maximum value not showing up in the first \(M\) entries is \[ \left(1 - \frac{1}{D+1}\right)^M \] We want this number be equal to \(1-p\). Thus \[ M = \left\lceil\frac{\log(1-p)}{\log\left(\frac{D}{D+1}\right)}\right\rceil \] For example, for a large list of (estimated) 100 distinct integers, in order to guess the largest item with 99% accuracy we need to read 459 items.
There is nothing specific about the maximum: one can replace it with any reducer function acting on the set of distinct elements obtained from the first \(M\) elements in our list.
Here is an implementation of probabilistic calculation of the reducer function applied to the correct beginning portion of a large list.
defun calculate (xs fn p &optional (n 1) (buffer (list (car xs))))
(let* ((d (length buffer))
(- 1.0 (expt (/ d (+ 1.0 d)) n))))
(q (if (or (null xs) (> q p))
(list :result (reduce fn buffer) :distinct d :checked n :probability q)
(cdr xs) fn p (1+ n) (union buffer (list (car xs))))))) (calculate (
CALCULATE
The list of distinct items is kept in buffer
. The number
n
indicates how many items read from the list so far.
Let us test this on an example list: below we calculate the maximum of a large list of integers.
loop repeat 100000 collect (random 100)) #'max 0.99) (calculate (
99 DISTINCT 99 CHECKED 459 PROBABILITY 0.990079) (RESULT
The code below finds the average of the list of integers which fluctuate around \(10\) with an error \(\pm 2\):
let* ((xs (loop repeat 10000 collect (+ 10 (- (random 5) 2))))
(length xs))
(n (#'+ 0.999))
(res (calculate xs / (getf res :result) (getf res :distinct))))
(ave (append (list :average ave) (cddr res))) (
10 DISTINCT 5 CHECKED 38 PROBABILITY 0.9990203) (AVERAGE
Here is a worst case scenario: the average of a list of floats which fluctuate around \(1.0\) with an error \(\pm 0.5\):
let* ((xs (loop repeat 10000 collect (+ 1.0 (- (random 1.0) 0.5))))
(length xs))
(n (#'+ 0.999))
(res (calculate xs / (getf res :result) (getf res :distinct))))
(ave (append (list :average ave) (cddr res))) (
0.9995118 DISTINCT 9995 CHECKED 10001 PROBABILITY 0.6322368) (AVERAGE
The number of distinct items in the list is close to the number of terms in the list. We went through the whole list. Even then the probability that we get the right thing is not satisfactory. [Can you guess why? Think of all floats with this property. How many are there?]
I am fond of recursion. IMHO Recursive functions do explain most computations in a very succint way. But, you don’t have to buy into this. Here is a more imperative version of the implementation above in case recursion doesn’t do it for you
defun calculate (xs fn p)
(do* ((n 0 (1+ n))
(cdr ys))
(ys xs (nil (union buffer (list (car ys))))
(buffer 0 (length buffer))
(d 0.0 (- 1.0 (expt (/ d (+ 1.0 d)) n))))
(q or (null ys) (> q p))
((list :result (reduce fn buffer)
(
:distinct d
:checked n :probability q))))
CALCULATE
And finally, here is an implementation in clojure in case Common Lisp doesn’t do it for you:
defn calculate [xs f p]
(loop [ys xs
(
buffer #{}0
n 0.0]
q if (or (empty? ys) (> q p))
(:result (reduce f buffer)
{:distinct (count buffer)
:checked n
:probability q}
recur (rest ys)
(first ys)})
(clojure.set/union buffer #{(inc n)
(- 1.0 (Math/pow (- 1.0 (/ (+ 1.0 (count buffer)))) n)))))) (