Recently, I saw a nice algorithm by Vincent Granville that simulates samples obeying Zipf’s Law using a nice physical process. But I had to play with the parameters of the simulation a bit.
I will copy the description of the algorithm (with small changes) from Vincent:
Each particle is assigned a unique bin ID between 1 and \(N\). Each particle represents a cluster with one element (the particle),and the bin ID is its cluster ID.
Iteration: repeat \(r\) times:
The implementation was pretty straightforward.
Start with a comparison function that compares the sizes of two clusters. The following function is what I thought to be appropriate, but of course, you may change it to suit your needs.
(defun comp (a b &optional (eps 1d-1))
(< (abs (- 5d-1 (/ a (+ a b)))) eps))
Now, the main function. The function takes the following parameters:
num
: The size of the population.rep
: The number of times the simulation runs.u
: The probability that two similar clusters will
merge.v
: The probability that two non-similar clusters will
merge. (defun zipf-simulate (num rep u v)
(let ((bag (loop for i from 0 below num collect (list i))))
(loop while (and (cdr bag) (> rep 0)) do
(let* ((m (length bag))
(i (random m))
(j (do ((k (random m) (random m))) ((not (= i k)) k)))
(a (nth i bag))
(b (nth j bag))
(thrsh (if (comp (length a) (length b))
u
v)))
(if (< (random 1.0d0) thrsh)
(progn
(nconc (nth i bag) b)
(setf bag (delete b bag :test #‘equal))))
(decf rep)))
(sort (mapcar #'length bag) #’>)))
I ran the simulation with the parameters num=2000
,
rep=8000
, u=0.8
and v=0.001
and
the results were:
1048
513
256
128
32
16
4
2
1
The numbers decrease exponentially which is expected with such a small \(v\) parameter. Think of the extreme case of\(u=1.0\) (probability that two similar clusters will merge) and \(v=0.0\) (probability that two non-similar clusters will merge.) Then from an initial cluster of size \(N\) one gets sizes distributed exactly as the binary expansion of \(N\) which is quite interesting.
When I ran the simulation with the parameters (num=4095
,
rep=7600
, u=0.8
and v=0.3
) chosen
along the lines Vincent
suggested, I got something which is closer to what he expected.
And here are another implementation in (non-idiomatic) clojure in case common lisp is too archaic for you:
(defn zipf [n rep u v]
(let [rnd (java.util.Random.)
mycom (fn [a b]
(if (< (Math/abs (- (double 0.5) (double (/ a (+ a b))))) 0.01)
(double u)
(double v)))]
(loop [bag (map list (range n))
cou rep]
(if (or (zero? cou) (< (count bag) 2))
(sort (map count bag))
(let* [m (count bag)
i (.nextInt rnd m)
j (loop [j (.nextInt rnd m)] (if (not (= i j)) j (recur (.nextInt rnd m))))
a (nth bag i)
b (nth bag j)]
(if (< (.nextFloat rnd) (mycom (count a) (count b)))
(recur (cons (concat a b) (remove #{a b} bag)) (dec cou))
(recur bag (dec cou))))))))