The Kitchen Sink and Other Oddities

Atabey Kaygun

Entropy of truncated MD5 hashing

Description of the problem

MD5 is a commonly used hashing algorithm to store passwords whose input can vary in length (from a single character to an entire text of Iliad if you’d like) but whose output is uniformly 128 bits. Today I’d like to test the following: if we were to select certain bytes from the output and use these bytes only as our hash, would we see something other than a truly uniform random output? Are there weaknesses in MD5, not when we use the whole 128 bits but certain parts of it?

Entropy

Entropy is a good numerical measure for randomness. So, I will generate a good number of random strings then apply the hash function. Then I will use a fixed portion of the output and measure the entropy. If the entropy is less than the entropy for a uniformly random collection of bytes then MD5 might have weakness in those bytes.

Implementation

I am going to need an MD5 implementation for lisp.

(require :md5)

The following function is for generating random strings of a specific length. I lifted the code from code codex with minimal modification.

(defun random-string (n)
   (let* ((chars "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789")
          (m (length chars)))
      (coerce (loop repeat n collect (aref chars (random m))) 'string)))

RANDOM-STRING

The following function generates a collection of random strings and then pushes through MD5 hashing function with the specified projection. At the end we calculate all the collisions and calculate the entropy of these collisions.

(defun entropy (m n pick)
   (let ((coll (make-hash-table :test 'equal))
         (res 0.0d0))
      (loop repeat m do
         (let* ((str (md5:md5sum-string (random-string n)))
                (pro (mapcar (lambda (i) (aref str i)) pick))
                (val (gethash pro coll 0)))
            (setf (gethash pro coll) (1+ val))))
      (maphash 
         (lambda (x y) 
             (incf res (* y (log y 2d0)))) 
         coll)
      (- (log m 2d0) (/ res m))))

ENTROPY

I am going to test all of the projections of MD5 onto 2 bytes.

Byte 1 Byte 2 Entropy
0 1 15.452
0 2 15.453
0 3 15.452
0 4 15.455
0 5 15.453
0 6 15.453
0 7 15.451
0 8 15.451
0 9 15.452
0 10 15.453
0 11 15.455
0 12 15.456
0 13 15.449
0 14 15.449
0 15 15.453
1 2 15.447
1 3 15.454
1 4 15.454
1 5 15.457
1 6 15.455
1 7 15.456
1 8 15.456
1 9 15.451
1 10 15.450
1 11 15.450
1 12 15.451
1 13 15.453
1 14 15.451
1 15 15.452
2 3 15.449
2 4 15.454
2 5 15.451
2 6 15.455
2 7 15.455
2 8 15.449
2 9 15.453
2 10 15.452
2 11 15.453
2 12 15.451
2 13 15.453
2 14 15.456
2 15 15.454
3 4 15.450
3 5 15.451
3 6 15.452
3 7 15.453
3 8 15.453
3 9 15.451
3 10 15.455
3 11 15.449
3 12 15.451
3 13 15.456
3 14 15.452
3 15 15.452
4 5 15.453
4 6 15.454
4 7 15.454
4 8 15.453
4 9 15.451
4 10 15.453
4 11 15.452
4 12 15.449
4 13 15.449
4 14 15.456
4 15 15.454
5 6 15.456
5 7 15.455
5 8 15.448
5 9 15.458
5 10 15.451
5 11 15.450
5 12 15.452
5 13 15.458
5 14 15.449
5 15 15.455
6 7 15.453
6 8 15.451
6 9 15.454
6 10 15.451
6 11 15.453
6 12 15.450
6 13 15.452
6 14 15.449
6 15 15.452
7 8 15.453
7 9 15.448
7 10 15.452
7 11 15.448
7 12 15.452
7 13 15.450
7 14 15.449
7 15 15.452
8 9 15.456
8 10 15.454
8 11 15.455
8 12 15.454
8 13 15.454
8 14 15.450
8 15 15.457
9 10 15.451
9 11 15.446
9 12 15.453
9 13 15.451
9 14 15.449
9 15 15.455
10 11 15.448
10 12 15.454
10 13 15.454
10 14 15.451
10 15 15.448
11 12 15.454
11 13 15.452
11 14 15.452
11 15 15.452
12 13 15.451
12 14 15.450
12 15 15.451
13 14 15.451
13 15 15.452
14 15 15.455

An analysis

The theoretical limit for the entropy for truly random collection of 2 bytes is \[ -\sum_{i=1}^{2^{16}} \frac{1}{2^{16}}\log_2(2^{-16}) = 16 \] and the values are uniformly close to the theoretical value. I didn’t have time to repeat the experiment with 3 or more bytes. The code above took about a minute on my old workstation at work.

In short, if you’d like to cut short of an MD5 hash you can use any portion of it you’d like.