Hamming Distance and Hashing Functions

Description of the Problem

As yesterday, assume \((X,d)\) is a metric space of bitstrings of a fixed length together with the Hamming distance. Today, I am going to look at the distributions of numbers \(d(f(x),x)\) for \(x\in X\).

Implementation

As yesterday, I am going to use ironclad:

(require :ironclad)
NIL

and the same MD5 and SHA1 hashing functions:

(let ((hasher (ironclad:make-digest :md5)))
  (defun md5-digest (x)
    (coerce (ironclad:digest-sequence hasher x) 'list)))
MD5-DIGEST
(let ((hasher (ironclad:make-digest :sha1)))
    (defun sha1-digest (x)
       (coerce (ironclad:digest-sequence hasher x) 'list)))
SHA1-DIGEST

Since the space of bit strings of length 128 is large, I will use a Monte-Carlo method:

(defun random-vector (m)
   (coerce
      (loop repeat m collect (random 256))
      '(vector (unsigned-byte 8))))
RANDOM-VECTOR

And finally, the function that calculates the quantity I am interested in is

(defun hamming-distance (x y)
   (labels ((calc (x y &optional (acc 0))
               (cond
               ((< x y) (calc y x 0))
                 ((zerop x) acc)
                 (t (calc (ash x -1)
                          (ash y -1)
                          (if (equal (oddp x) (oddp y)) 
                             acc
                             (1+ acc)))))))
        (reduce #'+ (map 'list (lambda (i j) (calc i j)) x y))))
HAMMING-DISTANCE

Now, let me test this on a sample of size 40,000 on MD5:

(with-open-file (out "hamming2-data1.csv" :direction :output
                                          :if-exists :supersede
                                          :if-does-not-exist :create)
   (loop repeat 40000 do
       (let ((x (random-vector 16)))
          (format out "~4,9f~%" (/ (hamming-distance  (md5-digest x) x) 1.0)))))
NIL

Here is a histogram of the data:

For the majority of the vectors, it is most likely that about 50% of the bits of the output will be different than the input.

Now, let me test this again on SHA1. Recall that since SHA1 digest produces 20 bytes, I am going to need random vectors of length 20:

(with-open-file (out "hamming2-data2.csv" :direction :output
                                          :if-exists :supersede
                                          :if-does-not-exist :create)
   (loop repeat 40000 do
       (let ((x (random-vector 20)))
          (format out "~4,9f~%" (/ (hamming-distance  (sha1-digest x) x) 1.0)))))
NIL

Here is a histogram of the data:

Again, about 50% of the bits are different between the input and output vectors.

An Analysis

I took the data from these experiments and normalized the values to the interval [0,1]. When I compared the distributions with the Kolmogorov-Smirnov test I see that

Two-sample Kolmogorov-Smirnov test

data:  normalize(as.matrix(X)) and normalize(as.matrix(Y))
D = 0.2524, p-value < 2.2e-16
alternative hypothesis: two-sided

which means that the distributions are different. Very interesting!

Older Posts

[2025-02-23] Counting Matroids

[2025-02-12] Sampling from a Random Variable

[2025-01-29] Markov Numbers

[2024-12-24] Number of isomorphism classes of simple graph (continued)

[2024-12-22] Counting Isomorphism Classes of Graphs

[2024-11-25] Connected Components of Graphs

[2024-11-24] Counting connected components of a graph

[2024-11-18] Counting Isomorphism Classes of m-ary Trees

[2024-11-16] Number of Isomorphism Classes of Ternary Trees

[2024-11-12] Hosoya Index of Balanced Binary Trees

[2024-11-11] Hosoya Index of a Graph

[2024-10-29] The Clique Number of a Simple Graph

[2024-10-28] The Size of Maximally Independent Subsets in a Graph

[2023-11-03] Graph Algorithms in JGraphT with Common Lisp

[2023-10-28] An Implementation of Pandas’ cut and qcut in Lisp

[2023-07-24] A Collatz-like Conjecture for the Projective Line

[2023-03-06] Twin Primes, Cousin Primes, Sexy Primes, and Prime Triplets

[2023-03-02] Set of All Partitions of a Finite Set

[2023-02-14] Non-crossing Partitions and Dyck Words

[2023-02-13] Non-crossing Linear Chords

[2023-02-04] Clojure/Python Interop Examples

[2023-01-14] Graph Algorithms in Clojure with JGraphT

[2022-03-29] 2D-Random Walk

[2022-03-28] Trade Deficit vs Exchange Rate Curve

[2022-03-16] Working with World Bank Data in Python

[2022-03-09] Working with European Central Bank data in python (revisited)

[2022-01-24] A Clique Analysis of Quakers in early modern Britain (1500-1700)

[2021-12-05] Boyer–Moore and Misra-Gries Algorithms in Clojure

[2021-09-12] Tension in Text Plotted

[2021-09-02] Statistical Distributions using Apache Commons Math in Clojure

[2021-08-31] Reduce with Intermediate Results in Common Lisp

[2021-08-21] Multivariate Regression Implemented in Clojure

[2021-05-29] Using Neural Networks to Detect Graph Properties

[2021-04-17] Fast Null-Space Calculation via LU-Decomposition

[2021-02-24] Stoer-Wagner Algorithm in Clojure

[2021-02-19] Calculating Vertex Covers in Clojure

[2021-02-18] Listing All Paths in a Graph

[2021-02-14] Strict Dyck Words and Fibonacci Numbers

[2021-02-14] Kruskal’s Algorithm in Common Lisp

[2021-02-13] Kruskal's Algorithm Implemented in Clojure

[2021-02-10] An integer dynamical system of integers

[2021-02-08] Binary Symmetrization

[2021-01-28] Prüfer Encoding and Decoding of a Tree in Clojure

[2021-01-27] Counting Cycle-Free Paths in a Graph

[2021-01-27] Counting Connected Labeled Graphs

[2020-12-18] Counting Graphs with a Prescribed Degree Sequence

[2020-12-13] Havel–Hakimi Algorithm in Clojure

[2020-12-12] Havel–Hakimi Algorithm in Common-Lisp

[2020-10-23] The Quadratic Casimir Element

[2020-07-04] Collatz Sequence in Binary

[2020-07-02] A Lazy Sequence of Primes in Clojure

[2020-06-10] Yet Another Fizz-Buzz in Common Lisp

[2020-05-12] ECB Data with Clojure and Vega-Lite

[2020-05-06] Processing ECB Data with Common Lisp

[2020-04-17] Next Permutation in the Lexicographical Ordering

[2020-04-13] Turkish Hyphenation in Common Lisp

[2020-04-01] Using JavaPlex with Clojure

[2019-11-05] Constricted Arithmetic Progressions

[2019-11-03] The Number of Arithmetic Progressions of Integers

[2019-05-06] Bron-Kerbosch Algorithm in Clojure

[2019-05-01] An Implementation of Ford-Fulkerson Algorithm in Clojure

[2019-04-22] Document Summarization via Nonnegative Matrix Factorization

[2019-04-20] Latent Semantic Analysis in Clojure

[2019-04-13] K-Nearest Neighbors Algorithm in Clojure

[2019-04-06] K-Means Implemented in Clojure

[2019-03-19] Prüfer Encoding/Decoding of a Tree in Common Lisp

[2019-03-05] Gale-Shaply Algorithm in Common Lisp

[2019-03-02] Calculating The Correct Rank of a Matrix

[2018-12-04] Feed-forward and back-propagation in neural networks as left- and right-fold

[2018-10-31] Nonnegative Matrix Decomposition in Clojure

[2018-10-30] Non-negative Matrix Decomposition in Scala

[2018-08-30] Working with European Central Bank Data in Scala

[2018-07-30] Perverse Sequences

[2018-05-28] Online Perceptron in Common Lisp

[2018-05-26] Online Perceptron

[2018-05-18] Online Regression

[2018-05-06] Knut’s Algorithm-S in Common Lisp

[2018-02-28] Irreducible Dyck Words

[2018-02-19] Optimization with GNU Scientific Library for Lisp

[2018-02-10] Van Eck’s Sequence

[2018-02-09] Hiring networks in mathematics

[2018-02-08] Linus Sequence

[2018-02-05] Egyptian Fractions

[2018-02-01] Listing all Young Tableaux

[2018-01-23] Collatz sequence (yet again)

[2018-01-15] Hofstadter's Q sequence

[2018-01-09] Farey Sequence

[2018-01-09] Catalan's Triangle

[2018-01-06] The Shoelace Formula for the Area of a Polygon

[2017-10-01] Working with European Central Bank Data in Python

[2017-09-27] Expected Value of the Diameter of a Tree

[2017-09-26] Using Quandl with kixi.stats on Clojure

[2017-09-22] Using Quandl with Common Lisp

[2017-08-05] Solving Linear Equations in Natural Numbers

[2017-07-31] Transitive Closure of a Directed Graph or a Relation

[2017-07-20] Steenrod-Milnor and Tournament Sequences

[2017-07-15] A lower bound on the radius of a graph

[2017-07-08] All partitions of an integer

[2017-07-06] Some Hasse Diagrams

[2017-07-04] Shuffles

[2017-07-03] Kaprekar Sequence

[2017-07-01] Lattice of Dyck Words

[2017-06-28] The poset of connected subgraphs of a connected graph

[2017-06-21] Calculating the Diameter and the Radius of a Graph Using Tropic Linear Algebra

[2017-06-19] Generating random regular graphs

[2017-06-14] Estimating the maximum element of a large list

[2017-06-09] A Stochastic Gradient Descent Implementation in Clojure

[2017-06-06] A topology problem

[2017-04-22] Listing duplicate files

[2017-03-14] My First Idris Proof

[2016-12-02] Distinguishing hash functions (part II)

[2016-12-01] Distinguishing hash functions

[2016-10-20] Hofstadter-Conway $10,000 sequence

[2016-08-18] A Solution for Problem 171 of 4Clojure

[2016-08-16] Puzzles and Group Theory

[2016-08-13] Using Weka within Lisp

[2016-07-12] Funniest and Unfunniest Jokes in the Jester Dataset

[2016-07-05] Generating Uniformly Random Connected Graphs

[2016-06-16] The Robinson-Schensted Algorithm

[2016-06-01] Conjugate Partitions

[2016-04-27] Using Word2Vec from Clojure

[2016-04-24] Using Word2Vec from Common Lisp

[2016-04-18] A Migration Analysis

[2016-04-11] Basic Data Analysis with CL without Frameworks

[2016-03-25] Parallel map-reduce in Common Lisp

[2016-02-22] Text Summarization and Topic Analysis

[2016-01-27] Set Covering Problem

[2016-01-25] Kolmogorov-Smirnov Test

[2016-01-20] Eigen-values of the Laplacian and Connected Components of a Graph

[2015-12-12] Dual Graphs

[2015-10-26] Longest Increasing Subsequence Revisited

[2015-10-16] Document Summarization via Markov Chains

[2015-10-07] Computational Literary Analysis

[2015-09-30] Library of Babel in Common Lisp

[2015-09-28] Merging Association Lists in Common Lisp

[2015-07-22] Cheapest Paths via Tropic Matrices

[2015-07-21] Hidden Markov Models via Tropic Matrices

[2015-07-08] A non-technical post

[2015-06-28] An implementation of the Viterbi algorithm in Common Lisp

[2015-05-28] Greatest Common Divisor of Two Rational Numbers

[2015-05-21] Partitions of Equal Measure Whatever the Measure May Be

[2015-05-14] Finding Cliques in a Graph

[2015-05-12] Set Cover Problem

[2015-05-03] Threading Macros in Common Lisp

[2015-05-03] Happy Numbers

[2015-05-01] Collatz Primes

[2015-04-23] Splitting Streams

[2015-04-06] Hamming Distance and Double Hashing

[2015-04-05] Hamming Distance and Hashing Functions

[2015-04-05] Hamming Derivative of Hashing Functions

[2015-04-02] A Topology Problem

[2015-03-21] Curve Fitting is a Gram-Schmidt Reduction

[2015-03-08] Maximum number of characters using keystrokes A, Ctrl+A, Ctrl+C and Ctrl+V

[2015-03-06] Eccentricity, Radius and Diameter in a Graph, Revisited

[2015-03-01] Graphs and Entropy

[2015-02-22] Math PhD Hiring Network (Part 3)

[2015-02-19] Math PhD Hiring Network (Part 2)

[2015-02-18] Math PhD Hiring Network (Part 1)

[2015-02-17] Faculty Networks and Inequality in Hiring Practices in Universities

[2015-02-10] Functional Streams in Lisp Explained

[2015-02-05] Collatz-type Conjectures (Continued)

[2015-02-04] Collatz-type Conjectures (Continued)

[2015-01-31] Collatz-type Conjectures (Continued)

[2015-01-30] Collatz-type Conjectures

[2015-01-28] Experiments with Infinite Recursive Sequences (continued)

[2015-01-17] Experiments with Infinite Recursive Sequences

[2015-01-10] Goldbach Pairs

[2015-01-02] Collatz Lengths (Continued)

[2015-01-01] Functional Streams

[2014-12-27] Polarization in the US Congress

[2014-12-18] Partition a sequence

[2014-11-28] Uniformly Random Permutations

[2014-11-22] An Implementation of Ford-Fulkerson Algorithm in Common Lisp

[2014-11-17] Tropic Calculation of Cheapest Paths

[2014-11-05] Longest common subsequence of two sequences

[2014-10-30] Counting Spanning Trees of a Graph

[2014-10-26] Longest Increasing Subsequence

[2014-10-24] The Number of Inversions in a Sequence

[2014-10-22] Hashes and Entropy

[2014-10-09] Estimating Cardinality with Constant Memory Complexity

[2014-09-30] Landau's Function

[2014-09-29] A Problem on Substitution Ciphers and Group Theory

[2014-09-28] A Morse Code Translator

[2014-09-23] A Memoization Macro for Common Lisp

[2014-09-21] Reducers are Monoid Morphisms

[2014-09-18] Number of isomorphism classes of binary trees

[2014-09-07] CONS is your friend

[2014-08-22] A Zipf's Law Simulation

[2014-08-07] Generating Uniformly Random Trees

[2014-07-09] A Solution for Project Euler #463

[2014-06-12] Entropy of truncated MD5 hashing

[2014-06-08] Hexadecimal digits of π

[2014-02-11] Information content of n-grams

[2014-02-08] Turkish Sentiment Analysis Using Thesaurus Distance

[2014-02-01] Sentiment analysis using word distances

[2014-01-27] Phase transitions in entropy

[2013-12-13] Optimal length of n-grams

[2013-12-10] Counting strings that contain intervals of same letter repetitions

[2013-12-02] Patterns Separating Large Texts

[2013-11-23] Collatz Sequences (Continued)

[2013-11-11] Entropy and approximately one-to-one maps

[2013-10-23] Tree Isomorphism

[2013-10-15] Self Organizing Maps

[2013-09-15] Euler Project #401

[2013-09-15] An additively recursive definition of the Moebius function

[2013-09-11] An Unsuccessful Attempt for Solving Euler Project #401

[2013-09-04] Uniform Sampling from Parametrized Submanifolds in Scala

[2013-09-04] Uniform Sampling from Parametrized Submanifolds

[2013-08-30] Randomly Generated Points Obeying a Distribution

[2013-08-25] Simulated Annealing in Lisp

[2013-08-21] Eigenvalues and Eigenvectors in GSLL

[2013-08-16] Reservoir Sampling

[2013-08-11] Gibbs sampling in lisp compared with C

[2013-08-10] Logistic Regression in lisp

[2013-08-10] Linear Discriminant Analysis in R

[2013-07-17] A Gradient Descent Implementation in Lisp

[2013-07-01] k-Nearest Neighbor Classification Algorithm Implemented in Lisp

[2013-05-19] Newton-Raphson Method

[2013-05-07] Levenshtein Distance

[2013-04-15] Cut points in a graph

[2013-04-01] Experiments on Collatz Lengths (Continued)

[2013-02-18] The sound of the torsion parts of homotopy groups of spheres

[2013-02-12] Monadic Units

[2013-02-07] Distribution of Collatz Lengths (continued)

[2013-02-03] Distribution of Collatz Lengths

[2013-01-31] Quotients of polynomial algebras

[2013-01-12] Path ideals

[2013-01-10] McCarthy91 Terminates

[2013-01-09] Finding all paths in a directed graph

[2013-01-04] A Simple Monte-Carlo Integration Implementation in Lisp

[2012-12-30] A simple problem in Kolmogorov-Chaitin complexity

[2012-12-29] From walks to paths

[2012-12-16] Higher order functions, functors and monads

[2012-12-13] Eccentricity, Radius and Diameter in an Undirected Graph

[2012-11-29] Untitled

[2012-11-25] Strictly Increasing Labels of Directed Graphs

[2012-11-19] Strictly Increasing Labellings of Directed Graphs

[2012-11-17] Nilpotent elements in an artinian algebra

[2012-11-04] Local rings, idempotents and non-invertible elements

[2012-10-18] An implementation of the fixed-radius near neighbor clustering algorithm in lisp

[2012-10-15] Reducing directed graphs

[2012-10-10] An implementation of the k-means clustering algorithm in lisp

[2012-10-08] A comparison of different map functions in lisp

[2012-10-03] Source code entropy

[2012-09-28] Collisions in random walks

[2012-09-26] Transitive closure of a directed graph

[2012-09-26] Solving linear equations in ℕ

[2012-09-26] Listing partitions

[2012-09-26] Inverting formal power series

[2012-09-26] Hasse subgraph of a directed graph