MD5 is a commonly used hashing algorithm to store passwords whose input can vary in length (from a single character to an entire text of Iliad if you’d like) but whose output is uniformly 128 bits. Today I’d like to test the following: if we were to select certain bytes from the output and use these bytes only as our hash, would we see something other than a truly uniform random output? Are there weaknesses in MD5, not when we use the whole 128 bits but certain parts of it?
Entropy is a good numerical measure for randomness. So, I will generate a good number of random strings then apply the hash function. Then I will use a fixed portion of the output and measure the entropy. If the entropy is less than the entropy for a uniformly random collection of bytes then MD5 might have weakness in those bytes.
I am going to need an MD5 implementation for lisp.
(require :md5)
The following function is for generating random strings of a specific length. I lifted the code from code codex with minimal modification.
(defun random-string (n)
(let* ((chars "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789")
(m (length chars)))
(coerce (loop repeat n collect (aref chars (random m))) 'string)))
RANDOM-STRING
The following function generates a collection of random strings and then pushes through MD5 hashing function with the specified projection. At the end we calculate all the collisions and calculate the entropy of these collisions.
(defun entropy (m n pick)
(let ((coll (make-hash-table :test 'equal))
(res 0.0d0))
(loop repeat m do
(let* ((str (md5:md5sum-string (random-string n)))
(pro (mapcar (lambda (i) (aref str i)) pick))
(val (gethash pro coll 0)))
(setf (gethash pro coll) (1+ val))))
(maphash
(lambda (x y)
(incf res (* y (log y 2d0))))
coll)
(- (log m 2d0) (/ res m))))
ENTROPY
I am going to test all of the projections of MD5 onto 2 bytes.
Byte 1 | Byte 2 | Entropy |
---|---|---|
0 | 1 | 15.452 |
0 | 2 | 15.453 |
0 | 3 | 15.452 |
0 | 4 | 15.455 |
0 | 5 | 15.453 |
0 | 6 | 15.453 |
0 | 7 | 15.451 |
0 | 8 | 15.451 |
0 | 9 | 15.452 |
0 | 10 | 15.453 |
0 | 11 | 15.455 |
0 | 12 | 15.456 |
0 | 13 | 15.449 |
0 | 14 | 15.449 |
0 | 15 | 15.453 |
1 | 2 | 15.447 |
1 | 3 | 15.454 |
1 | 4 | 15.454 |
1 | 5 | 15.457 |
1 | 6 | 15.455 |
1 | 7 | 15.456 |
1 | 8 | 15.456 |
1 | 9 | 15.451 |
1 | 10 | 15.450 |
1 | 11 | 15.450 |
1 | 12 | 15.451 |
1 | 13 | 15.453 |
1 | 14 | 15.451 |
1 | 15 | 15.452 |
2 | 3 | 15.449 |
2 | 4 | 15.454 |
2 | 5 | 15.451 |
2 | 6 | 15.455 |
2 | 7 | 15.455 |
2 | 8 | 15.449 |
2 | 9 | 15.453 |
2 | 10 | 15.452 |
2 | 11 | 15.453 |
2 | 12 | 15.451 |
2 | 13 | 15.453 |
2 | 14 | 15.456 |
2 | 15 | 15.454 |
3 | 4 | 15.450 |
3 | 5 | 15.451 |
3 | 6 | 15.452 |
3 | 7 | 15.453 |
3 | 8 | 15.453 |
3 | 9 | 15.451 |
3 | 10 | 15.455 |
3 | 11 | 15.449 |
3 | 12 | 15.451 |
3 | 13 | 15.456 |
3 | 14 | 15.452 |
3 | 15 | 15.452 |
4 | 5 | 15.453 |
4 | 6 | 15.454 |
4 | 7 | 15.454 |
4 | 8 | 15.453 |
4 | 9 | 15.451 |
4 | 10 | 15.453 |
4 | 11 | 15.452 |
4 | 12 | 15.449 |
4 | 13 | 15.449 |
4 | 14 | 15.456 |
4 | 15 | 15.454 |
5 | 6 | 15.456 |
5 | 7 | 15.455 |
5 | 8 | 15.448 |
5 | 9 | 15.458 |
5 | 10 | 15.451 |
5 | 11 | 15.450 |
5 | 12 | 15.452 |
5 | 13 | 15.458 |
5 | 14 | 15.449 |
5 | 15 | 15.455 |
6 | 7 | 15.453 |
6 | 8 | 15.451 |
6 | 9 | 15.454 |
6 | 10 | 15.451 |
6 | 11 | 15.453 |
6 | 12 | 15.450 |
6 | 13 | 15.452 |
6 | 14 | 15.449 |
6 | 15 | 15.452 |
7 | 8 | 15.453 |
7 | 9 | 15.448 |
7 | 10 | 15.452 |
7 | 11 | 15.448 |
7 | 12 | 15.452 |
7 | 13 | 15.450 |
7 | 14 | 15.449 |
7 | 15 | 15.452 |
8 | 9 | 15.456 |
8 | 10 | 15.454 |
8 | 11 | 15.455 |
8 | 12 | 15.454 |
8 | 13 | 15.454 |
8 | 14 | 15.450 |
8 | 15 | 15.457 |
9 | 10 | 15.451 |
9 | 11 | 15.446 |
9 | 12 | 15.453 |
9 | 13 | 15.451 |
9 | 14 | 15.449 |
9 | 15 | 15.455 |
10 | 11 | 15.448 |
10 | 12 | 15.454 |
10 | 13 | 15.454 |
10 | 14 | 15.451 |
10 | 15 | 15.448 |
11 | 12 | 15.454 |
11 | 13 | 15.452 |
11 | 14 | 15.452 |
11 | 15 | 15.452 |
12 | 13 | 15.451 |
12 | 14 | 15.450 |
12 | 15 | 15.451 |
13 | 14 | 15.451 |
13 | 15 | 15.452 |
14 | 15 | 15.455 |
The theoretical limit for the entropy for truly random collection of 2 bytes is \[ -\sum_{i=1}^{2^{16}} \frac{1}{2^{16}}\log_2(2^{-16}) = 16 \] and the values are uniformly close to the theoretical value. I didn’t have time to repeat the experiment with 3 or more bytes. The code above took about a minute on my old workstation at work.
In short, if you’d like to cut short of an MD5 hash you can use any portion of it you’d like.