Imagine you have a stream of integers which satisfies:
The number of distinct items the stream hands to you is finite.
Each item is going to appear equally likely.
We want to estimate the number of distinct items appearing in the stream. The problem is that if the number of items is large, can we make a statistical guess about the cardinality without storing all, or some, of the stream?
There are very nice and efficient algorithms out there, such as LogLog and HyperLogLog. Nick Johnson wrote an extensive blog entry. I recommend that you see it. He also has python implementations for these algorithms. But those algorithms require hashing the inputs and counting initial segments of these generated hash data.
Today, I am going to describe a much more simplified single pass algorithm whose memory complexity is constant.
Assume \((a_n)_{n\in\mathbb{N}}\) is a very long sequence of integers, and assume you have seen an item \(x\) in the stream. What is the expected number of items you are going to see after \(x\) until you see \(x\) again? It is a simple statistics question. We have \(m\) distinct items and each item appears in the stream with a probability of \(\frac{1}{m}\). Thus the number we need is given by an infinite series \[ \sum_{k=1}^\infty \left(1-\frac{1}{m}\right)^{k-1}\frac{k}{m} = \frac{1}{m}\frac{d}{dx} \left\|_{1-\frac{1}{m}}\left(\frac{1}{1-x}\right)\right. = \frac{1}{m}\left.\frac{1}{(1-x)^2}\right\|_{1-\frac{1}{m}} = m \] This means, if we get a good statistical estimate for this number then we can estimate the number of distinct items in out stream.
For the purposes of fixing the terminology, an interval is a subsequence of integers appearing in our stream which start and end with the same object \(x\) without appearing in the middle. We want to find the average lengths of such intervals.
Without further ado, here is the code:
(defun estimate (stream m &optional x j n c)
(let ((y (funcall stream)))
(cond
((null x) (estimate stream m y 1 1 0))
((= m 0) (floor (+ 0.5d0 (/ c n))))
((= x y) (estimate stream (1- m) (funcall stream) 1 (1+ n) (+ c j)))
(t (estimate stream (1- m) x (1+ j) n c)))))
ESTIMATE
The function returns the estimate for the number we want to calculate I described earlier in the previous section. Do not worry. We are not going to blow your stack because of the deeply recursive nature of the implementation if the lisp compiler does proper tail call optimization as it is supposed to do.
Let me briefly explain what the variables are:
stream
represents our stream implemented as a
function which accepts no input and returns the next item in the
stream.
m
represents the number of items we are going to
pull from the stream to calculate our estimate.
x
represents the item at the beginning of an
interval. It will change next time around when we look at a different
interval.
j
represents the number of items we saw in the
interval
n
represents the number of interval we saw up until
that point.
c
represents the total length of all intervals we
saw up until that point.
Every time we pull a new element y
from the stream, we
compare it with x
. Depending if these are equal we make
adjustments and call our function recursively. We return the average
length after we pull m
items from the stream.
Let us do the test. For that I need an experiment function:
(defun experiment (n num)
(let* ((m (* n 500))
(col (loop repeat num collect
(estimate (lambda () (random n)) m)))
(ave (/ (reduce #'+ col) num 1.0d0))
(dev (sqrt (/ (reduce #'+ (mapcar (lambda (x) (expt (- x ave) 2)) col)) num))))
(list ave dev (/ dev n))))
EXPERIMENT
And, now the actual test:
(experiment 100 25)
(100.24 5.14027236632457 0.0514027236632457)
(experiment 1000 25)
(996.36 44.59226838814101 0.04459226838814101)
(experiment 10000 25)
(9995.04 497.4946415791833 0.04974946415791833)
You need to pay attention to the last number: the relative error. I played with the number of items to pull from the stream. The optimal choice for \(N\) seems to be about 500 times \(m\) (the number of distinct items) to keep the relative error about 5%. My numerical experiments indicate that the relative error grows like \(o(m/N)\) where \(m\) is the number of distinct items and \(N\) is the size of the sample.
For those who would prefer a more imperative version of the implementation, here is one:
(defun estimate (stream m)
(let ((c 0)
(n 1)
(j 1)
(x (funcall stream))
(y (funcall stream)))
(dotimes (i m (floor (+ 0.5d0 (/ c n))))
(if (= x y)
(progn
(incf c j)
(incf n 1)
(setf j 0)
(setf x (funcall stream))))
(incf j)
(setf y (funcall stream)))))
ESTIMATE