Levenshtein Distance also known as the edit distance is a metric on the set of all words over a fixed alphabet. The Levenshtein distance between two words \(\omega_1\) and \(\omega_2\) is the minimal number of single character edits one has to make to convert \(\omega_1\) to \(\omega_2\), or vice versa. The distance formula given in the Wikipedia article above has a very nice recursive implementation. The implementation I will give below is partly inspired by a post on the Racket Blog on memoization vs. dynamic programming. The implementation I give below is memoized. For a dynamic programming approach you may look at the implementation over at the Rosetta Code.
I will not spend to much time explaining the implementation. The algorithm is very straightforward. The main difference between the implementation at the Racket Blog and the implementation below is that below I use lisp strings as arguments while the implementation I cited above uses lists.
(defun levenshtein (a b)
(cond ((zerop (length a)) (length b))
((zerop (length b)) (length a))
(t (if (equal (char a 0) (char b 0))
(levenshtein (subseq a 1) (subseq b 1))
(1+ (min (levenshtein (subseq a 1) b)
(levenshtein a (subseq b 1))
(levenshtein (subseq a 1) (subseq b 1))))))))
LEVENSHTEIN
Now let me test the code on an example:
(time (levenshtein "hahahahhahhahhhaa" "haahahahhahahahahhhaa"))
4
Evaluation took:
0.004 seconds of real time
0.004000 seconds of total run time (0.004000 user, 0.000000 system)
100.00% CPU
10,271,520 processor cycles
1,544,856 bytes consed
Now, below I implement the same function with memoization:
(defparameter table-levenshtein
(make-hash-table :test 'equal))
(defun lookup-levenshtein (a b)
(gethash (list a b) table-levenshtein))
(defun push-levenshtein (a b val)
(setf (gethash (list a b) table-levenshtein) val))
(defun levenshtein (a b)
(cond ((zerop (length a)) (length b))
((zerop (length b)) (length a))
(t (or (lookup-levenshtein a b)
(push-levenshtein a b
(if (equal (char a 0) (char b 0))
(levenshtein (subseq a 1) (subseq b 1))
(1+ (min (levenshtein (subseq a 1) b)
(levenshtein a (subseq b 1))
(levenshtein (subseq a 1) (subseq b 1))))))))))
LEVENSHTEIN
And here is the same test above:
(time (levenshtein "hahahahhahhahhhaa" "haahahahhahahahahhhaa"))
4
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
239,724 processor cycles
16,240 bytes consed
I have been looking into learning some Scala lately, and this problem seems to be a good way to write some scala code. So, here is the code without the memoization feature:
object Main {
def levenshtein(a:String, b:String):Int = {
if( a.length == 0 )
b.length
else if( b.length == 0 )
a.length
else if( a(0) == b(0) )
levenshtein( a.substring(1), b.substring(1) )
else
1 + (List (levenshtein(a,b.substring(1)),
levenshtein(a.substring(1),b),
levenshtein(a.substring(1),b.substring(1))) min)
}
def main(args:Array[String]) {
println("The Levenshtein distance between "+args(0)+" and "+args(1)+" is "+levenshtein(args(0),args(1)))
}
}