Distinguishing hash functions (part II)

Description of the problem

Yesterday, I asked if it was possible to distinguish hashing functions by looking at their outputs? The theoretical answer is, of course, it should not be possible. But yet, I saw statistically significant differences between the outputs of SHA1 and ADLER32 hash functions. How is that possible?

An explanation

The short answer is: the difference I saw yesterday was an artifact of the experiment.

The long answer is this: Assume we take a uniformly random sequence of integers between 0 and N, and then divide them by $N$ to get a sequence $x_1,\ldots,x_a$. Now, repeat the same experiment with pulling random integers between 0 and M, and then divide them by $M$ to get another sequence $y_1,\ldots,y_b$. Now, we got two uniformly random sequences of rational numbers between 0 and 1. So, these sequences should be similar, right? Wrong!

(defun test (n m)
   (mapcar (lambda (x) (/ x n)) (loop repeat m collect (random n))))
TEST

(let ((xs (test 200 1000))
      (ys (test   5 1000)))
   (ks-test xs ys))
(0.20520520520520522d0 . 0.0d0)

The experiment redesigned

I will use almost all of the code from yesterday, except

(defun random-bag (num hash-fn hash-size cut-size)
   (let ((hsize (truncate hash-size 8))
         (csize (truncate cut-size 8)))
      (labels ((random-vector ()
                  (coerce (loop repeat hsize collect (random 256))
                          '(vector (unsigned-byte 8)))))
         (loop repeat num collect
               (subseq (funcall hash-fn (random-vector)) 0 csize)))))
RANDOM-BAG

and the main experiment code.

(defun experiment (xs num)
   (let ((n (length xs)))
      (loop repeat num collect
         (hamming-distance (elt xs (random n)) (elt xs (random n))))))
EXPERIMENT

This time around, to make sure that we get the same length outputs I am truncating hash-digests to 32 bits since ADLER32 produces the shortest digests of 32 bits.

From a given bag xs we pick two random vectors, calculate their Hamming distance. We repeat this for a given number num times. This time around I am not going to normalize the distances to 1. KS-test still works.

(let ((md5-bag (random-bag 25000 #'md5-digest 128 32))
      (sha1-bag (random-bag 32000 #'sha1-digest 160 32))
      (adler32-bag (random-bag 40000 #'adler32-digest 32 32)))
   (list (ks-test (experiment sha1-bag 800)
                  (experiment sha1-bag 600))
         (ks-test (experiment md5-bag 700)
                  (experiment sha1-bag 600))
         (ks-test (experiment md5-bag 900)
                  (experiment adler32-bag 500))
         (ks-test (experiment sha1-bag 800)
                  (experiment adler32-bag 500))))
((0.08524637432851168d0 . 3.0952126842698213d-4)
 (0.09077599528064184d0 . 9.573944195506723d-5)
 (0.09263019921935078d0 . 3.545942388081258d-4)
 (0.1045846386138986d0 . 3.314053202163603d-5))

The reference number we should look at is the first one where I compare SHA1 hash outputs with themselves. Now, differences disappeared as they should.