An implementation of the $k$-means clustering algorithm in lisp

Description of the problem and the algorithm

The $k$-means clustering algorithm is probably one of the simplest clustering algorithms available. The set-up is simple: we have a metric space $(X,d)$ and a collection of points $D = \{x_1,\ldots, x_n\}$ in $X$ that we would like to write as a disjoint union of $k$ subsets \[ D = D_1 \sqcup \cdots \sqcup D_k \] In most applications, the data is numerical and therefore we can assume that $X$ is a subset of a finite dimensional vector space over ${\mathbb R}$. But if the data is categorical and can not be easily ranked, or ordered, then we really must work with an abstract metric space. Our algorithm works as follows:

Select an initial set of centers $y^1_1,\ldots, y^1_k$ for our clusters $D_1,\ldots, D_k$. The selection can be made randomly, or can be given by hand.
Assuming we have these centers $y^m_1,\ldots,y^m_k$ at step $m$, go over every point in our data set $D$. Postulate that $x\in D^m_i$ if the distance of $x$ to $y^m_i$ is minimal among all such centers. In other words $x\in D^m_i$ if and only if \[ d(x,y^m_i) = min\{ d(x,y^m_j)\|\ j = 1,\ldots, k\} \]
Construct new centers $y^{m+1}_i$ from the cluster subsets $D^mi$ we formed in the previous step. If $(X,d)$ is a subset of a vector space then this is simply the average of all vectors in $D^m_i$.
Repeat steps 2 and 3 until $y^m_i = y^{m+1}_i$ for every $i=1,\ldots,k$.

An implementation in lisp

First, I need to write few utility functions.

(defun range (N) (loop for i from 0 below N collect i))

(defun select-column (collection column)
   (map 'vector (lambda (x) (aref x column)) collection))
SELECT-COLUMN

I will assume that the data is given as a collection of comma seperated entries one data point per line. I would like to write a utility function so that I can select specific columns of data I would like to consider. I will also need the regular expression library cl-ppcre

(require :cl-ppcre)

The next utility function will take the name of the data file as its argument and load the data. It will return a list of entries.

(defun read-data(name)
    (with-open-file (file name)
        (let (line)
            (loop while (setf line (read-line file nil)) collect
                  (if (> (length line) 0)
                     (map 'vector 'read-from-string (ppcre:split #\, line)))))))

Let us test the function on a sample data. The data comes from the UCI data repository.

(setq data (read-data "iris.csv"))

Our algorithm will need two fundamental functions both of which should be specific to the data at hand. The first is the distance function. Both distance and cluster-center functions we are going to define must depend on the structures of the data points. For this data the $L^1$-distance [also known as the taxi cab distance] is going to be our distance function.

(defun distance (x y)
   (let ((N (length x)))
       (loop for i from 0 below (1- N) sum (abs (- (aref x i) (aref y i))))))

Let us test the function on two points from our data set:

(setq NUM (length data))
(setq point-1 (nth (random NUM) data))
(setq point-2 (nth (random NUM) data))
(format nil "Distance between ~A and ~A is ~A~%" point-1 point-2 (distance point-1 point-2))
Distance between #(5.4 3.9 1.7 0.4 IRIS-SETOSA) and 
#(6.8 3.0 5.5 2.1IRIS-VIRGINICA) is 7.8

The center of a collection of points is the point obtained by taking the averages of each coordinate.

(defun cluster-center (collection)
   (let* ((N (length collection))
          (m (length (car collection)))
          (res (make-array m)))
      (map-into res (lambda (x) (/ (reduce '+ (select-column collection x)) N)) (range (1- m)))
      res))

Let us test this function on the whole dataset to find the center.

(setq data-center (cluster-center data))
    #(5.8433347 3.054 3.7586665 1.1986669 NIL)

Now, the following function which-cluster decides which cluster a given the data point belongs to. It takes two arguments: a data point x and $k$-data points designated as cluster centers centers.

(defun which-cluster (x centers)
    (let* ((N (length centers))
           (class (aref x (1- (length x))))
           (res (mapcar 'cons
                     (coerce (map 'vector (lambda (u) (distance x u)) centers) 'list)
                     (range N))))
       (cdar (sort res (lambda (a b) (< (car a) (car b)))))))

The next function k-means-interate takes two arguments: a training data set training-data and a set of data points of size $k$ for the clusters we would like to form. It returns the next set of centers for the clusters.

(defun k-means-iterate (training-data centers)
   (let ((clusters (make-array (length centers))))
      (dolist (y training-data)
          (push y (aref clusters (which-cluster y centers))))
      (map 'vector 'cluster-center clusters)))
K-MEANS-ITERATE

One has to be careful here. It is possible that k-means-iterate returns less than $k$ centers even if it is given exactly $k$ centers. In that case the iteration will fail in the next run ungracefully. You are warned. I might add extra code to deal with this boundary case but I’d rather re-run the program until I get exactly $k$ centers.

And finally we write the training function. This will take a training dataset and iterate k-means-iterate until the center points for our clusters stabilize enough.

(defun k-means-train (training-data centers &optional (error 0.003))
   (let ((N (length centers))
         v u i)
      (do* ((v centers u)
            (u (k-means-iterate training-data v) (k-means-iterate training-data v)))
           ((< (loop for i from 0 below N sum (distance (aref u i) (aref v i))) error) u))))
K-MEANS-TRAIN

I will run the whole data set through the algorithm. I will use three centers, that is I would like to split the data set into 3 clusters.

(setq centers
   (let* ((copy (coerce data 'vector))
          (NUM (length copy)))
      (map 'vector (lambda (x) (aref copy (random NUM))) #(1 2 3))))
#(#(6.4 2.9 4.3 1.3 IRIS-VERSICOLOR) #(6.0 2.2 5.0 1.5 IRIS-VIRGINICA)
  #(4.6 3.2 1.4 0.2 IRIS-SETOSA))

Now, I will train the algorithm and it will give us the centers for the clusters. Then using these cluster centers, I can figure out which cluster does any given any point belong to using which-cluster function. I am going to use the error value 0.004 which was completely arbitrary. You are free to not to pass an error bound. In that case the default value will be 0.003.

(setf class-centers (k-means-train data centers 0.004))
#(#(5.006 3.418 1.4640001 0.24399994 NIL)
  #(6.87027 3.0864863 5.7459464 2.0891888 NIL)
  #(5.9047627 2.7460315 4.4126983 1.4333336 NIL))

Let us see how successful the algorithm was:

(setq N (length data-center))
(setq results (loop for x in data collect (list (aref x (1- N))
(which-cluster x class-centers))))
(setq report nil)
(dolist (x results)
     (if (assoc x report :test #'equal)
       (incf (cdr (assoc x report :test #'equal)))
       (push (cons x 1) report)))
report
(((IRIS-VIRGINICA 2) . 15) ((IRIS-VIRGINICA 0) . 35)
 ((IRIS-VERSICOLOR 0) . 2) ((IRIS-VERSICOLOR 2) . 48)
 ((IRIS-SETOSA 1) . 50))

Consider the following 3x3 matrix where we display the way our clusters are formed.

	Cluster 0	Cluster 1	CLuster 2
Versicolor	2	0	48
Virginica	35	0	15
Setosa	0	50	0

The results indicate that if the return value from which-cluster function for a data point $x$ was 1 using the centers provided by k-means-train function, the point $x$ is very likely to be a Iris Setosa. If the function returns 2 for $x$ then $x$ is very likely to be a Iris Versicolor or Iris Virginica. Moreover, the odds that it is a Iris Versicolor is approximately 3 times more than $x$ being a Iris Virginica. If the function returns 0, then $x$ is very likely to be a Iris Virginica

Ok. We are done with this data set. I am going to use another data set from UCI machine learning repository.

I am going to be lazy and instead of writing new distance and cluster-center functions specific to this new data set, I am going to modify the new data such that the last column is now the class ascribed to the data point on each row.

(setq data (read-data "wine.csv"))

Ok. Let us check the distance function

(setq NUM (length data))
(setq point-1 (nth (random NUM) data))
(setq point-2 (nth (random NUM) data))
(format nil "Distance between ~A and ~A is ~A~%" point-1 point-2 (distance point-1 point-2))
Distance between #(12.77 2.39 2.28 19.5 86 1.39 0.51 0.48 0.64
9.899999 0.57 1.63 470 3) and
#(12.08 1.83 2.32 18.5 81 1.6 1.5 0.52 1.64 2.4 1.08 2.27 480 2) is 28.179998

and test the cluster-center function on the whole data

(setf data-center (cluster-center data))
#(13.000614 2.336348 2.3665185 19.494946 8877/89 2.2951121 2.02927
  0.36185396 1.5908992 5.0580897 0.95744956 2.6116843 132947/178 NIL)

As before, I need 3 clusters. I will again recycle the code above for setting the first 3 cluster centers randomly:

(setq centers
   (let* ((copy (coerce data 'vector))
          (NUM (length copy)))
      (map 'vector (lambda (x) (aref copy (random NUM))) #(1 2 3))))
#(#(13.03 0.9 1.71 16 86 1.95 2.03 0.24 1.46 4.6 1.19 2.48 392 2)
  #(12.37 0.94 1.36 10.6 88 1.98 0.57 0.28 0.42 1.95 1.05 1.82 520 2)
  #(12.37 1.63 2.3 24.5 88 2.22 2.45 0.4 1.9 2.12 0.89 2.78 342 2))

Now, let us run k-means-train

(setq class-centers (k-means-train data centers 0.004))
#(#(12.51191 2.4873528 2.2838235 20.776472 6271/68 2.0670583 1.7754412
    0.38808835 1.4613234 4.074706 0.9419118 2.4957347 7757/17 NIL)
  #(13.804467 1.8834045 2.42617 17.023403 4959/47 2.867234 3.014256
    0.28531918 1.9104254 5.702553 1.0782979 3.1140423 56172/47 NIL)
  #(12.9284115 2.5112698 2.4112694 19.955551 932/9 2.114444 1.5684129
    0.39063492 1.4923812 5.6387305 0.8840635 2.3620634 5083/7 NIL))

Let us see how successful the algorithm was:

(setq N (length data-center))
(setq results (loop for x in data collect (list (aref x (1- N)) (which-cluster x class-centers))))
(setq report nil)
(dolist (x results)
     (if (assoc x report :test #'equal)
       (incf (cdr (assoc x report :test #'equal)))
       (push (cons x 1) report)))
report
(((3 1) . 18) ((3 0) . 30) ((2 2) . 1) ((2 0) . 20) ((2 1) . 50)
 ((1 0) . 13) ((1 2) . 46))

Consider the following 3x3 matrix where we display the way our clusters are formed.

	Cluster 0	Cluster 1	Cluster 2
Class 1	13	0	46
Class 2	20	50	1
Class 3	30	18	0

The table tells us the following. If the function which-cluster returns 2 for a data point $x$ using the centers we obtained from k-means-train then it is likely that the data point $x$ comes from a Class 1 wine. If the return value is 1 then $x$ most likely comes from a Class 2 or Class 3 wine. Moreover, the odds of $x$ being a Class 2 wine as opposed to a Class 3 wine is approximately 5 to 2. If the function which-cluster returns 0, the return value is not going to be much of a use.