Yesterday, I asked if it was possible to distinguish hashing functions by looking at their outputs? The theoretical answer is, of course, it should not be possible. But yet, I saw statistically significant differences between the outputs of SHA1 and ADLER32 hash functions. How is that possible?
The short answer is: the difference I saw yesterday was an artifact of the experiment.
The long answer is this: Assume we take a uniformly random sequence of integers between 0 and N, and then divide them by \(N\) to get a sequence \(x_1,\ldots,x_a\). Now, repeat the same experiment with pulling random integers between 0 and M, and then divide them by \(M\) to get another sequence \(y_1,\ldots,y_b\). Now, we got two uniformly random sequences of rational numbers between 0 and 1. So, these sequences should be similar, right? Wrong!
(defun test (n m)
(mapcar (lambda (x) (/ x n)) (loop repeat m collect (random n))))
TEST
(let ((xs (test 200 1000))
(ys (test 5 1000)))
(ks-test xs ys))
(0.20520520520520522d0 . 0.0d0)
I will use almost all of the code from yesterday, except
(defun random-bag (num hash-fn hash-size cut-size)
(let ((hsize (truncate hash-size 8))
(csize (truncate cut-size 8)))
(labels ((random-vector ()
(coerce (loop repeat hsize collect (random 256))
'(vector (unsigned-byte 8)))))
(loop repeat num collect
(subseq (funcall hash-fn (random-vector)) 0 csize)))))
RANDOM-BAG
and the main experiment code.
(defun experiment (xs num)
(let ((n (length xs)))
(loop repeat num collect
(hamming-distance (elt xs (random n)) (elt xs (random n))))))
EXPERIMENT
This time around, to make sure that we get the same length outputs I am truncating hash-digests to 32 bits since ADLER32 produces the shortest digests of 32 bits.
From a given bag xs
we pick two random vectors,
calculate their Hamming distance. We repeat this for a given number
num
times. This time around I am not going to normalize the
distances to 1. KS-test still works.
(let ((md5-bag (random-bag 25000 #'md5-digest 128 32))
(sha1-bag (random-bag 32000 #'sha1-digest 160 32))
(adler32-bag (random-bag 40000 #'adler32-digest 32 32)))
(list (ks-test (experiment sha1-bag 800)
(experiment sha1-bag 600))
(ks-test (experiment md5-bag 700)
(experiment sha1-bag 600))
(ks-test (experiment md5-bag 900)
(experiment adler32-bag 500))
(ks-test (experiment sha1-bag 800)
(experiment adler32-bag 500))))
((0.08524637432851168d0 . 3.0952126842698213d-4)
(0.09077599528064184d0 . 9.573944195506723d-5)
(0.09263019921935078d0 . 3.545942388081258d-4)
(0.1045846386138986d0 . 3.314053202163603d-5))
The reference number we should look at is the first one where I compare SHA1 hash outputs with themselves. Now, differences disappeared as they should.