Reservoir Sampling

The main idea of this post comes from a post from Daniel Lemire’s blog.

Description of the problem

Assume that you have a stream of distinct objects of some unknown length, and that we are supposed to take a sample of $k$ objects from this stream. We can use the algorithm of Reservoir Sampling to perform the task efficiently and fast.

The algorithm

Function Sample
Input: k the number of samples
       S the stream of objects of unknown length
Output: A list of objects from S of length k
Initialize: Let res be the first k terms in S
Begin
  i <- k+1
  While S is not empty
     Read an element x from S
     Let j be a random integer between 1 and i
     If j <= k then
        res[j] <- x
     End
     i <- (i+1)
  End
  Return res
End

An analysis of the algorithm

It is easy to see that the algorithm finishes in one pass, and the complexity of each iteration is independent of the length of the stream. So, the algorithm is of order $\mathcal{O}(n)$ where $n$ is the length of the stream.

I lifted the argument fro the correctness of the algorithm I present here (with some pedantic changes) from here. I will use induction to show correctness. The base case when $n$ the length of the stream is equal to $k$ the number of elements we are supposed to pick. Then algorithm clearly states that we pick every element from the stream. Thus probability of picking each $x\in S$ is $1/k$. Assume, as our induction hypothesis, that at $m$-th step the probability of picking each $x$ from the stream so far is $1/m$. Let $y$ be the element of $S$ provided to us at the $m+1$-st step. Our algorithm dictates that the probability of picking $y$ is $k/(m+1)$ and replacing an element at the $i$-th position in res is $1/k$. Thus the probability of an element at the $i$-th position in res is replaced with $y$ is $1/(m+1)$, and the probability of the rest of the elements in res to survive to the next round is $m/(m+1)$. The probability that we picked these elements from rounds 1 to $m$ were $1/m$. Thus, at the next round the probabilities become $(1/m)(m/(m+1)) = 1/(m+1)$.

As one can see, this is a streaming algorithm. If the stream is cut off at any point, the sample we chose is chosen uniformly until that point. At each iteration, the fact that elements are chosen uniformly from the stream is preserved even though the size of the sample space is increasing in each turn.

An implementation

The implementation is very straight-forward:

(defun sample(k S)
   (let ((res (make-array k))
         (i 0))
      (dolist (x S)
         (if (< i k)
            (setf (aref res i) x)
            (let ((j (random i)))
               (if (< j k) (setf (aref res j) x))))
         (incf i))
      (coerce res 'list)))

Let me test it:

(loop repeat 20 do
  (let* ((x (1+ (random 9)))
         (y (+ x (random (- 200 x))))
         (z (loop for i from 1 to y collect i)))
    (format t "~%    ~1D ~3D: ~A" x y (sample x z))))

7 128: (37 63 108 116 109 126 42)
4 155: (67 82 101 105)
6 193: (135 2 58 177 53 27)
3  21: (21 5 13)
9  51: (1 37 38 47 5 33 7 11 20)
2  92: (16 50)
5 119: (99 102 17 105 25)
6  65: (1 54 43 30 56 47)
3 128: (98 54 114)
3  94: (36 33 57)
4 162: (149 108 111 30)
5 138: (125 11 3 51 110)
1 159: (33)
2 116: (105 57)
6 166: (107 81 160 131 78 26)
3 155: (127 17 115)
6 185: (39 133 3 177 130 32)
5  45: (29 10 20 4 44)
7 147: (38 30 29 57 91 90 82)
6  69: (33 59 57 37 52 26)