If \((X,d)\) is a metric space and if \(f\colon X\to X\) is a self map there is a quantity of interest: I’d like to see the distribution of the numbers defined for every \(x\neq y\)
\[\frac{d(f(x),f(y))}{d(x,y)}\]
In today’s note, the metric space is the space of bitstrings of a fixed length with the Hamming distance as metric and \(f\colon X\to X\) is going to be MD5 hash digest and SHA1 hash digest functions.
I am going to need ironclad which is a native common lisp hashing and crypto library.
(require :ironclad)
NIL
which has the MD5 and SHA1 hashing functions:
(let ((hasher (ironclad:make-digest :md5)))
(defun md5-digest (x)
(coerce (ironclad:digest-sequence hasher x) 'list)))
MD5-DIGEST
(let ((hasher (ironclad:make-digest :sha1)))
(defun sha1-digest (x)
(coerce (ironclad:digest-sequence hasher x) 'list)))
SHA1-DIGEST
The function md5-digest
will return the digest as a list
of 16 bytes (unsigned 8-bit integers). The space of bit strings of
length 128 is large. So, I will use a Monte-Carlo method which requires
for us to generate random byte strings of length 16:
(defun random-vector (m)
(coerce
(loop repeat m collect (random 256))
'(vector (unsigned-byte 8))))
RANDOM-VECTOR
Now the function that calculates the ratio we are interested in
is
(defun hamming-distance (x y)
(labels ((calc (x y &optional (acc 0))
(cond
((< x y) (calc y x 0))
((zerop x) acc)
(t (calc (ash x -1)
(ash y -1)
(if (equal (oddp x) (oddp y))
acc
(1+ acc)))))))
(reduce #'+ (map 'list (lambda (i j) (calc i j)) x y))))
HAMMING-DISTANCE
Now, let me test this on a sample of size 12000 on MD5:
(with-open-file (out "hamming-data.csv" :direction :output
:if-exists :supersede
:if-does-not-exist :create)
(loop repeat 12000 do
(let ((x (random-vector 16))
(y (random-vector 16)))
(format out "~4,2f~%" (/ (hamming-distance (md5-digest x) (md5-digest y))
(hamming-distance x y)
1.0)))))
NIL
Here is a histogram of the data:
Now, let me test this on SHA1. This time, the random vectors are of length 20 because SHA1 digest produces 20 bytes:
(with-open-file (out "hamming-data2.csv" :direction :output
:if-exists :supersede
:if-does-not-exist :create)
(loop repeat 12000 do
(let ((x (random-vector 20))
(y (random-vector 20)))
(format out "~4,2f~%" (/ (hamming-distance (sha1-digest x) (sha1-digest y))
(hamming-distance x y)
1.0)))))
NIL
Here is a histogram of the data I produced above:
I am going to use Kolmogorov-Smirnov
test to see if these distributions are similar. The result below
indicates that even though the statistic is small, the p-value tells us
that the distributions we see here are actually very different.
Two-sample Kolmogorov-Smirnov test
data: X and Y
D = 0.0365, p-value = 2.28e-07
alternative hypothesis: two-sided