Would it be possible to distinguish hashing functions by looking only at their outputs assuming that these hashes produce outputs of the same length?
I will approach the problem as follows: Assuming I have two bags of hashes, is it possible to decide if these hashes come from two different hash functions?
For each bag, I will calculate Hamming distances of random pairs divided by the length of the hash. This number is the percentage of different bits between two hashes. Then I will have two new bags of numbers between 0 and 1. Problem reduces to check if these bags differ statistically.
I will re-use two of my earlier posts: The first post I am going to re-use was on implementing Kolmogorov-Smirnov test to test if two bags of real numbers differ in their distribution. The second post I am going to re-use was about calculating Hamming derivatives of two hash functions.
Let us start with our package description, and libraries I am going to need:
(mapc #'require '(:cffi :ironclad))
(CFFI IRONCLAD)
Next are the hash functions:
(let ((hasher (ironclad:make-digest :md5)))
(defun md5-digest (x)
(coerce (ironclad:digest-sequence hasher x) 'list)))
MD5-DIGEST
(let ((hasher (ironclad:make-digest :sha1)))
(defun sha1-digest (x)
(coerce (ironclad:digest-sequence hasher x) 'list)))
SHA1-DIGEST
(let ((hasher (ironclad:make-digest :adler32)))
(defun adler32-digest (x)
(coerce (ironclad:digest-sequence hasher x) 'list)))
ADLER32-DIGEST
I am using MD5, SHA-1 and ADLER-32.
I am going to need an implementation of the Hamming distance function
(defun hamming-distance (x y)
(labels ((calc (x y &optional (acc 0))
(cond
((< x y) (calc y x 0))
((zerop x) acc)
(t (calc (ash x -1)
(ash y -1)
(if (equal (oddp x) (oddp y))
acc
(1+ acc)))))))
(reduce #'+ (map 'list (lambda (i j) (calc i j)) x y))))
HAMMING-DISTANCE
Next, I am going to need an implementation for the KS-test. I will lift the code from my earlier post on KS-test as is:
(cffi:load-foreign-library
"/home/kaygun/src/lisp/kolmogorov-smirnov/ks.so")
#<FOREIGN-LIBRARY KS.SO-705 "ks.so">
(cffi:defcfun ("KScdf" ks) :double (n :int) (x :double))
KS
(defun cdf (xs)
(let ((n (1- (length xs))))
(lambda (x)
(labels ((fn (a b)
(let ((m (truncate (/ (+ b a) 2))))
(cond
((or (<= x (elt xs a))
(<= (- b a) 1))
(/ a n 1.0d0))
((>= x (elt xs b))
(/ b n 1.0d0))
((> (elt xs m) x) (fn a m))
(t (fn m b))))))
(fn 0 n)))))
CDF
(defun ks-test (xs ys)
(multiple-value-bind (as bs)
(if (>= (length xs) (length ys))
(values xs ys)
(values ys xs))
(let* ((n (1- (length bs)))
(fn (cdf (sort as #'<)))
(cs (sort bs #'<))
(d (loop for i from 0 below (length cs) maximize
(abs (- (funcall fn (elt cs i))
(/ i n 1d0))))))
(cons d (- 1.0d0 (ks n d))))))
KS-TEST
The first part requires us to generate a random collection of hash digests:
(defun random-bag (num hash-fn hash-size)
(let ((size (truncate (/ hash-size 8))))
(labels ((random-vector ()
(coerce (loop repeat size collect (random 256))
'(vector (unsigned-byte 8)))))
(loop repeat num collect
(funcall hash-fn (random-vector))))))
RANDOM-BAG
The main part of our code is as follows:
(defun experiment (xs num size)
(let ((n (length xs)))
(loop repeat num collect
(/ (hamming-distance (elt xs (random n)) (elt xs (random n)))
size 1.0))))
EXPERIMENT
From a given bag xs
we pick two random vector, calculate
their hamming distance and then convert the distance to the percentage
of different bits by dividing the distance to size
. We
repeat this for a given number num
times.
OK. We are ready:
(let ((md5-bag (random-bag 40000 #'md5-digest 128))
(sha1-bag (random-bag 40000 #'sha1-digest 160))
(adler32-bag (random-bag 40000 #'adler32-digest 32)))
(list (ks-test (experiment md5-bag 1000 128)
(experiment sha1-bag 1000 160))
(ks-test (experiment md5-bag 1000 128)
(experiment adler32-bag 1000 32))
(ks-test (experiment sha1-bag 1000 160)
(experiment adler32-bag 1000 32))))
((0.08708708708708712d0 . 4.842864418375115d-7)
(0.23023023023023026d0 . 0.0d0) (0.25225225225225223d0 . 0.0d0))
There seems to be statistically significant differences between hash
function outputs. The KS-distance between MD5 and SHA1 is very small.
However, the KS-distances between ADLER-32 outputs from the other two
are significantly larger. So, the answer to my original question is: we
may distinguish hash functions by looking at their outputs only. I’ll
come back to this soon.