Distinguishing hash functions

Description of the problem

Would it be possible to distinguish hashing functions by looking only at their outputs assuming that these hashes produce outputs of the same length?

I will approach the problem as follows: Assuming I have two bags of hashes, is it possible to decide if these hashes come from two different hash functions?

For each bag, I will calculate Hamming distances of random pairs divided by the length of the hash. This number is the percentage of different bits between two hashes. Then I will have two new bags of numbers between 0 and 1. Problem reduces to check if these bags differ statistically.

Implementation

I will re-use two of my earlier posts: The first post I am going to re-use was on implementing Kolmogorov-Smirnov test to test if two bags of real numbers differ in their distribution. The second post I am going to re-use was about calculating Hamming derivatives of two hash functions.

Let us start with our package description, and libraries I am going to need:

(mapc #'require '(:cffi :ironclad))
(CFFI IRONCLAD)

Next are the hash functions:

(let ((hasher (ironclad:make-digest :md5)))
   (defun md5-digest (x)
      (coerce (ironclad:digest-sequence hasher x) 'list)))
MD5-DIGEST

(let ((hasher (ironclad:make-digest :sha1)))
   (defun sha1-digest (x)
      (coerce (ironclad:digest-sequence hasher x) 'list)))
SHA1-DIGEST

(let ((hasher (ironclad:make-digest :adler32)))
   (defun adler32-digest (x)
      (coerce (ironclad:digest-sequence hasher x) 'list)))
ADLER32-DIGEST

I am using MD5, SHA-1 and ADLER-32.

I am going to need an implementation of the Hamming distance function

(defun hamming-distance (x y)
   (labels ((calc (x y &optional (acc 0))
               (cond
                 ((< x y) (calc y x 0))
                 ((zerop x) acc)
                 (t (calc (ash x -1)
                          (ash y -1)
                          (if (equal (oddp x) (oddp y)) 
                              acc
                              (1+ acc)))))))
     (reduce #'+ (map 'list (lambda (i j) (calc i j)) x y))))
HAMMING-DISTANCE

Next, I am going to need an implementation for the KS-test. I will lift the code from my earlier post on KS-test as is:

(cffi:load-foreign-library 
   "/home/kaygun/src/lisp/kolmogorov-smirnov/ks.so")
#<FOREIGN-LIBRARY KS.SO-705 "ks.so">

(cffi:defcfun ("KScdf" ks) :double (n :int) (x :double))
KS

(defun cdf (xs)
  (let ((n (1- (length xs))))
    (lambda (x)
      (labels ((fn (a b)
                 (let ((m (truncate (/ (+ b a) 2))))
                   (cond
                     ((or (<= x (elt xs a))
                          (<= (- b a) 1))
                      (/ a n 1.0d0))
                     ((>= x (elt xs b))
                      (/ b n 1.0d0))
                     ((> (elt xs m) x) (fn a m))
                     (t (fn m b))))))
        (fn 0 n)))))
CDF

(defun ks-test (xs ys)
  (multiple-value-bind (as bs)
      (if (>= (length xs) (length ys))
          (values xs ys)
          (values ys xs))
    (let* ((n (1- (length bs)))
           (fn (cdf (sort as #'<)))
           (cs (sort bs #'<))
           (d (loop for i from 0 below (length cs) maximize
                   (abs (- (funcall fn (elt cs i))
                           (/ i n 1d0))))))
      (cons d (- 1.0d0 (ks n d))))))
KS-TEST

The first part requires us to generate a random collection of hash digests:

(defun random-bag (num hash-fn hash-size)
   (let ((size (truncate (/ hash-size 8))))
      (labels ((random-vector ()
                  (coerce (loop repeat size collect (random 256))
                          '(vector (unsigned-byte 8)))))
         (loop repeat num collect
               (funcall hash-fn (random-vector))))))
RANDOM-BAG

The main part of our code is as follows:

(defun experiment (xs num size)
   (let ((n (length xs)))
      (loop repeat num collect
         (/ (hamming-distance (elt xs (random n)) (elt xs (random n)))
      size 1.0))))
EXPERIMENT

From a given bag xs we pick two random vector, calculate their hamming distance and then convert the distance to the percentage of different bits by dividing the distance to size. We repeat this for a given number num times.

OK. We are ready:

(let ((md5-bag (random-bag 40000 #'md5-digest 128))
      (sha1-bag (random-bag 40000 #'sha1-digest 160))
      (adler32-bag (random-bag 40000 #'adler32-digest 32)))
   (list (ks-test (experiment md5-bag 1000 128)
                  (experiment sha1-bag 1000 160))
         (ks-test (experiment md5-bag 1000 128)
                  (experiment adler32-bag 1000 32))
         (ks-test (experiment sha1-bag 1000 160)
                  (experiment adler32-bag 1000 32))))
((0.08708708708708712d0 . 4.842864418375115d-7)
 (0.23023023023023026d0 . 0.0d0) (0.25225225225225223d0 . 0.0d0))

Analysis

There seems to be statistically significant differences between hash function outputs. The KS-distance between MD5 and SHA1 is very small. However, the KS-distances between ADLER-32 outputs from the other two are significantly larger. So, the answer to my original question is: we may distinguish hash functions by looking at their outputs only. I’ll come back to this soon.