The Kitchen Sink and Other Oddities

Atabey Kaygun

Levenshtein Distance

Description of the problem

Levenshtein Distance also known as the edit distance is a metric on the set of all words over a fixed alphabet. The Levenshtein distance between two words \(\omega_1\) and \(\omega_2\) is the minimal number of single character edits one has to make to convert \(\omega_1\) to \(\omega_2\), or vice versa. The distance formula given in the Wikipedia article above has a very nice recursive implementation. The implementation I will give below is partly inspired by a post on the Racket Blog on memoization vs. dynamic programming. The implementation I give below is memoized. For a dynamic programming approach you may look at the implementation over at the Rosetta Code.

A Straightforward Implementation

I will not spend to much time explaining the implementation. The algorithm is very straightforward. The main difference between the implementation at the Racket Blog and the implementation below is that below I use lisp strings as arguments while the implementation I cited above uses lists.

(defun levenshtein (a b)
   (cond ((zerop (length a)) (length b))
         ((zerop (length b)) (length a))
         (t (if (equal (char a 0) (char b 0))
               (levenshtein (subseq a 1) (subseq b 1))
               (1+ (min (levenshtein (subseq a 1) b)
                        (levenshtein a (subseq b 1))
                        (levenshtein (subseq a 1) (subseq b 1))))))))
LEVENSHTEIN

Now let me test the code on an example:

(time (levenshtein "hahahahhahhahhhaa" "haahahahhahahahahhhaa"))
4

Evaluation took:
 0.004 seconds of real time
 0.004000 seconds of total run time (0.004000 user, 0.000000 system)
 100.00% CPU
 10,271,520 processor cycles
 1,544,856 bytes consed

An Implementation with Memoization

Now, below I implement the same function with memoization:

(defparameter table-levenshtein 
   (make-hash-table :test 'equal))

(defun lookup-levenshtein (a b)
   (gethash (list a b) table-levenshtein))

(defun push-levenshtein (a b val)
   (setf (gethash (list a b) table-levenshtein) val))

(defun levenshtein (a b)
   (cond ((zerop (length a)) (length b))
         ((zerop (length b)) (length a))
         (t (or (lookup-levenshtein a b)
                (push-levenshtein a b 
                   (if (equal (char a 0) (char b 0))
                      (levenshtein (subseq a 1) (subseq b 1))
                      (1+ (min (levenshtein (subseq a 1) b)
                               (levenshtein a (subseq b 1))
                               (levenshtein (subseq a 1) (subseq b 1))))))))))
LEVENSHTEIN

And here is the same test above:

(time (levenshtein "hahahahhahhahhhaa" "haahahahhahahahahhhaa"))
4

Evaluation took:
 0.000 seconds of real time
 0.000000 seconds of total run time (0.000000 user, 0.000000 system)
 100.00% CPU
 239,724 processor cycles
 16,240 bytes consed

Another Implementation in Scala

I have been looking into learning some Scala lately, and this problem seems to be a good way to write some scala code. So, here is the code without the memoization feature:

object Main {
   def levenshtein(a:String, b:String):Int = {
       if( a.length == 0 ) 
         b.length
       else if( b.length == 0 )
         a.length
       else if( a(0) == b(0) )
         levenshtein( a.substring(1), b.substring(1) )
       else 
         1 + (List (levenshtein(a,b.substring(1)),
                    levenshtein(a.substring(1),b),
                    levenshtein(a.substring(1),b.substring(1))) min)
   }

   def main(args:Array[String]) {
       println("The Levenshtein distance between "+args(0)+" and "+args(1)+" is "+levenshtein(args(0),args(1)))
   }
}