The main idea of this post comes from a post from Daniel Lemire’s blog.
Assume that you have a stream of distinct objects of some unknown length, and that we are supposed to take a sample of \(k\) objects from this stream. We can use the algorithm of Reservoir Sampling to perform the task efficiently and fast.
Function Sample
Input: k the number of samples
S the stream of objects of unknown length
Output: A list of objects from S of length k
Initialize: Let res be the first k terms in S
Begin
i <- k+1
While S is not empty
Read an element x from S
Let j be a random integer between 1 and i
If j <= k then
res[j] <- x
End
i <- (i+1)
End
Return res
End
It is easy to see that the algorithm finishes in one pass, and the complexity of each iteration is independent of the length of the stream. So, the algorithm is of order \(\mathcal{O}(n)\) where \(n\) is the length of the stream.
I lifted the argument fro the correctness of the algorithm I present
here (with some pedantic changes) from here.
I will use induction to show correctness. The base case when \(n\) the length of the stream is equal to
\(k\) the number of elements we are
supposed to pick. Then algorithm clearly states that we pick every
element from the stream. Thus probability of picking each \(x\in S\) is \(1/k\). Assume, as our induction hypothesis,
that at \(m\)-th step the probability
of picking each \(x\) from the stream
so far is \(1/m\). Let \(y\) be the element of \(S\) provided to us at the \(m+1\)-st step. Our algorithm dictates that
the probability of picking \(y\) is
\(k/(m+1)\) and replacing an element at
the \(i\)-th position in
res
is \(1/k\). Thus the
probability of an element at the \(i\)-th position in res
is
replaced with \(y\) is \(1/(m+1)\), and the probability of the rest
of the elements in res
to survive to the next round is
\(m/(m+1)\). The probability that we
picked these elements from rounds 1 to \(m\) were \(1/m\). Thus, at the next round the
probabilities become \((1/m)(m/(m+1)) =
1/(m+1)\).
As one can see, this is a streaming algorithm. If the stream is cut off at any point, the sample we chose is chosen uniformly until that point. At each iteration, the fact that elements are chosen uniformly from the stream is preserved even though the size of the sample space is increasing in each turn.
The implementation is very straight-forward:
(defun sample(k S)
(let ((res (make-array k))
(i 0))
(dolist (x S)
(if (< i k)
(setf (aref res i) x)
(let ((j (random i)))
(if (< j k) (setf (aref res j) x))))
(incf i))
(coerce res 'list)))
Let me test it:
(loop repeat 20 do
(let* ((x (1+ (random 9)))
(y (+ x (random (- 200 x))))
(z (loop for i from 1 to y collect i)))
(format t "~% ~1D ~3D: ~A" x y (sample x z))))
7 128: (37 63 108 116 109 126 42)
4 155: (67 82 101 105)
6 193: (135 2 58 177 53 27)
3 21: (21 5 13)
9 51: (1 37 38 47 5 33 7 11 20)
2 92: (16 50)
5 119: (99 102 17 105 25)
6 65: (1 54 43 30 56 47)
3 128: (98 54 114)
3 94: (36 33 57)
4 162: (149 108 111 30)
5 138: (125 11 3 51 110)
1 159: (33)
2 116: (105 57)
6 166: (107 81 160 131 78 26)
3 155: (127 17 115)
6 185: (39 133 3 177 130 32)
5 45: (29 10 20 4 44)
7 147: (38 30 29 57 91 90 82)
6 69: (33 59 57 37 52 26)