Today, I would like to investigate relative entropy change (before and after) in hashing. In others words, if I have a bag of words which have an initial entropy, after I hash these words what happens to the entropy of the resulting words?
First, I need a random vector of 8-bit vectors:
(defun random-vector (n)
(coerce (loop repeat n collect (random 256)) '(vector (unsigned-byte 8))))
RANDOM-VECTOR
But, I don’t need single individual words but sets of words since I need to calculate entropy:
(defun random-collection (n m)
(loop repeat n collect (random-vector (+ 10 (random m)))))
RANDOM-COLLECTION
Now, the entropy function
(defun entropy (collection)
(let ((result (make-array 256 :initial-element 0))
(entrop 0.0d0)
(N 0))
(dolist (i (loop for x in collection append (coerce x 'list)))
(incf (aref result i)))
(setf N (reduce #'+ result))
(- (log N 2)
(/ (reduce #'+ (map 'list (lambda (x) (if (> x 0) (* x (log x 2)) 0)) result))
N))))
ENTROPY
For hashing I will use the ironclad crypto library. I will define several hash functions where each return a vector of 8-bit bytes.
(require :ironclad)
(SB-ROTATE-BYTE IRONCLAD)
(let ((hasher (ironclad:make-digest :md5)))
(defun md5-digest (x)
(ironclad:digest-sequence hasher x)))
MD5-DIGEST
(let ((hasher (ironclad:make-digest :sha1)))
(defun sha1-digest (x)
(ironclad:digest-sequence hasher x)))
SHA1-DIGEST
(let ((hasher (ironclad:make-digest :sha224)))
(defun sha224-digest (x)
(ironclad:digest-sequence hasher x)))
SHA224-DIGEST
First, I need to write a function which will perform the numerical experiments:
(defun experiment (hashfn rep batch len filename)
(with-open-file (data (concatenate 'string filename ".csv")
:direction :output
:if-exists :overwrite
:if-does-not-exist :create)
(loop repeat rep do
(let* ((N (+ 5 (random batch)))
(M (+ 5 (random len)))
(clear (random-collection N M))
(hashed (mapcar hashfn clear)))
(format data "~4,2f ~4,2f~%" (entropy clear) (entropy hashed))))))
EXPERIMENT
Next, let me see what happens with the MD5 hash:
(experiment #'md5-digest 1000 100 10 "hash2-data-md5")
NIL
Now, with the SHA1 hash:
(experiment #'sha1-digest 1000 100 10 "hash2-data-sha1")
NIL
And, finally with SHA224 hash:
(experiment #'sha224-digest 1000 100 10 "hash2-data-sha224")
NIL
The result is not surprising: hashing isn’t supposed to add or take away any entropy since we expect the hash to be collision-free. Mathematically speaking, we ask the hash function to be one-to-one in practice. Theoretically speaking, this is impossible, hence the adjective in practice. In any case, this was an interesting experiment to see if there is any weaknesses in the SHA1, SHA224 or MD5 hashing.
There is something interesting: There is a slight bent in the SHA224 graph above: it looks like it is sub-linear. This could be an artifact of the small sample size, and definitely worth investigating in the future.