The \(k\)-means clustering algorithm is probably one of the simplest clustering algorithms available. The set-up is simple: we have a metric space \((X,d)\) and a collection of points \(D = \{x_1,\ldots, x_n\}\) in \(X\) that we would like to write as a disjoint union of \(k\) subsets \[ D = D_1 \sqcup \cdots \sqcup D_k \] In most applications, the data is numerical and therefore we can assume that \(X\) is a subset of a finite dimensional vector space over \({\mathbb R}\). But if the data is categorical and can not be easily ranked, or ordered, then we really must work with an abstract metric space. Our algorithm works as follows:
Select an initial set of centers \(y^1_1,\ldots, y^1_k\) for our clusters \(D_1,\ldots, D_k\). The selection can be made randomly, or can be given by hand.
Assuming we have these centers \(y^m_1,\ldots,y^m_k\) at step \(m\), go over every point in our data set \(D\). Postulate that \(x\in D^m_i\) if the distance of \(x\) to \(y^m_i\) is minimal among all such centers. In other words \(x\in D^m_i\) if and only if \[ d(x,y^m_i) = min\{ d(x,y^m_j)\|\ j = 1,\ldots, k\} \]
Construct new centers \(y^{m+1}_i\) from the cluster subsets \(D^mi\) we formed in the previous step. If \((X,d)\) is a subset of a vector space then this is simply the average of all vectors in \(D^m_i\).
Repeat steps 2 and 3 until \(y^m_i = y^{m+1}_i\) for every \(i=1,\ldots,k\).
First, I need to write few utility functions.
(defun range (N) (loop for i from 0 below N collect i))
(defun select-column (collection column)
(map 'vector (lambda (x) (aref x column)) collection))
SELECT-COLUMN
I will assume that the data is given as a collection of comma
seperated entries one data point per line. I would like to write a
utility function so that I can select specific columns of data I would
like to consider. I will also need the regular expression library
cl-ppcre
(require :cl-ppcre)
The next utility function will take the name of the data file as its argument and load the data. It will return a list of entries.
(defun read-data(name)
(with-open-file (file name)
(let (line)
(loop while (setf line (read-line file nil)) collect
(if (> (length line) 0)
(map 'vector 'read-from-string (ppcre:split #\, line)))))))
Let us test the function on a sample data. The data comes from the UCI data repository.
(setq data (read-data "iris.csv"))
Our algorithm will need two fundamental functions both of which
should be specific to the data at hand. The first is the distance
function. Both distance and cluster-center
functions we are going to define must depend on the structures of the
data points. For this data the \(L^1\)-distance [also known as the taxi
cab distance] is going to be our distance function.
(defun distance (x y)
(let ((N (length x)))
(loop for i from 0 below (1- N) sum (abs (- (aref x i) (aref y i))))))
Let us test the function on two points from our data set:
(setq NUM (length data))
(setq point-1 (nth (random NUM) data))
(setq point-2 (nth (random NUM) data))
(format nil "Distance between ~A and ~A is ~A~%" point-1 point-2 (distance point-1 point-2))
Distance between #(5.4 3.9 1.7 0.4 IRIS-SETOSA) and
#(6.8 3.0 5.5 2.1IRIS-VIRGINICA) is 7.8
The center of a collection of points is the point obtained by taking the averages of each coordinate.
(defun cluster-center (collection)
(let* ((N (length collection))
(m (length (car collection)))
(res (make-array m)))
(map-into res (lambda (x) (/ (reduce '+ (select-column collection x)) N)) (range (1- m)))
res))
Let us test this function on the whole dataset to find the center.
(setq data-center (cluster-center data))
#(5.8433347 3.054 3.7586665 1.1986669 NIL)
Now, the following function which-cluster decides which
cluster a given the data point belongs to. It takes two arguments: a
data point x and \(k\)-data points designated as cluster
centers centers.
(defun which-cluster (x centers)
(let* ((N (length centers))
(class (aref x (1- (length x))))
(res (mapcar 'cons
(coerce (map 'vector (lambda (u) (distance x u)) centers) 'list)
(range N))))
(cdar (sort res (lambda (a b) (< (car a) (car b)))))))
The next function k-means-interate takes two arguments:
a training data set training-data and a set of data points
of size \(k\) for the clusters we would
like to form. It returns the next set of centers for the clusters.
(defun k-means-iterate (training-data centers)
(let ((clusters (make-array (length centers))))
(dolist (y training-data)
(push y (aref clusters (which-cluster y centers))))
(map 'vector 'cluster-center clusters)))
K-MEANS-ITERATE
One has to be careful here. It is possible that
k-means-iterate returns less than \(k\) centers even if it is given exactly
\(k\) centers. In that case the
iteration will fail in the next run ungracefully. You are warned. I
might add extra code to deal with this boundary case but I’d rather
re-run the program until I get exactly \(k\) centers.
And finally we write the training function. This will take a training
dataset and iterate k-means-iterate until the center points
for our clusters stabilize enough.
(defun k-means-train (training-data centers &optional (error 0.003))
(let ((N (length centers))
v u i)
(do* ((v centers u)
(u (k-means-iterate training-data v) (k-means-iterate training-data v)))
((< (loop for i from 0 below N sum (distance (aref u i) (aref v i))) error) u))))
K-MEANS-TRAIN
I will run the whole data set through the algorithm. I will use three centers, that is I would like to split the data set into 3 clusters.
(setq centers
(let* ((copy (coerce data 'vector))
(NUM (length copy)))
(map 'vector (lambda (x) (aref copy (random NUM))) #(1 2 3))))
#(#(6.4 2.9 4.3 1.3 IRIS-VERSICOLOR) #(6.0 2.2 5.0 1.5 IRIS-VIRGINICA)
#(4.6 3.2 1.4 0.2 IRIS-SETOSA))
Now, I will train the algorithm and it will give us the centers for
the clusters. Then using these cluster centers, I can figure out which
cluster does any given any point belong to using
which-cluster function. I am going to use the error value
0.004 which was completely arbitrary. You are free to not
to pass an error bound. In that case the default value will be
0.003.
(setf class-centers (k-means-train data centers 0.004))
#(#(5.006 3.418 1.4640001 0.24399994 NIL)
#(6.87027 3.0864863 5.7459464 2.0891888 NIL)
#(5.9047627 2.7460315 4.4126983 1.4333336 NIL))
Let us see how successful the algorithm was:
(setq N (length data-center))
(setq results (loop for x in data collect (list (aref x (1- N))
(which-cluster x class-centers))))
(setq report nil)
(dolist (x results)
(if (assoc x report :test #'equal)
(incf (cdr (assoc x report :test #'equal)))
(push (cons x 1) report)))
report
(((IRIS-VIRGINICA 2) . 15) ((IRIS-VIRGINICA 0) . 35)
((IRIS-VERSICOLOR 0) . 2) ((IRIS-VERSICOLOR 2) . 48)
((IRIS-SETOSA 1) . 50))
Consider the following 3x3 matrix where we display the way our clusters are formed.
| Cluster 0 | Cluster 1 | CLuster 2 | |
|---|---|---|---|
| Versicolor | 2 | 0 | 48 |
| Virginica | 35 | 0 | 15 |
| Setosa | 0 | 50 | 0 |
The results indicate that if the return value from
which-cluster function for a data point \(x\) was 1 using the centers provided by
k-means-train function, the point \(x\) is very likely to be a Iris
Setosa. If the function returns 2 for \(x\) then \(x\) is very likely to be a Iris
Versicolor or Iris Virginica. Moreover, the odds that it
is a Iris Versicolor is approximately 3 times more than \(x\) being a Iris Virginica. If the
function returns 0, then \(x\) is very
likely to be a Iris Virginica
Ok. We are done with this data set. I am going to use another data set from UCI machine learning repository.
I am going to be lazy and instead of writing new
distance and cluster-center functions specific
to this new data set, I am going to modify the new data such that the
last column is now the class ascribed to the data point on each row.
(setq data (read-data "wine.csv"))
Ok. Let us check the distance function
(setq NUM (length data))
(setq point-1 (nth (random NUM) data))
(setq point-2 (nth (random NUM) data))
(format nil "Distance between ~A and ~A is ~A~%" point-1 point-2 (distance point-1 point-2))
Distance between #(12.77 2.39 2.28 19.5 86 1.39 0.51 0.48 0.64
9.899999 0.57 1.63 470 3) and
#(12.08 1.83 2.32 18.5 81 1.6 1.5 0.52 1.64 2.4 1.08 2.27 480 2) is 28.179998
and test the cluster-center function on the whole
data
(setf data-center (cluster-center data))
#(13.000614 2.336348 2.3665185 19.494946 8877/89 2.2951121 2.02927
0.36185396 1.5908992 5.0580897 0.95744956 2.6116843 132947/178 NIL)
As before, I need 3 clusters. I will again recycle the code above for setting the first 3 cluster centers randomly:
(setq centers
(let* ((copy (coerce data 'vector))
(NUM (length copy)))
(map 'vector (lambda (x) (aref copy (random NUM))) #(1 2 3))))
#(#(13.03 0.9 1.71 16 86 1.95 2.03 0.24 1.46 4.6 1.19 2.48 392 2)
#(12.37 0.94 1.36 10.6 88 1.98 0.57 0.28 0.42 1.95 1.05 1.82 520 2)
#(12.37 1.63 2.3 24.5 88 2.22 2.45 0.4 1.9 2.12 0.89 2.78 342 2))
Now, let us run k-means-train
(setq class-centers (k-means-train data centers 0.004))
#(#(12.51191 2.4873528 2.2838235 20.776472 6271/68 2.0670583 1.7754412
0.38808835 1.4613234 4.074706 0.9419118 2.4957347 7757/17 NIL)
#(13.804467 1.8834045 2.42617 17.023403 4959/47 2.867234 3.014256
0.28531918 1.9104254 5.702553 1.0782979 3.1140423 56172/47 NIL)
#(12.9284115 2.5112698 2.4112694 19.955551 932/9 2.114444 1.5684129
0.39063492 1.4923812 5.6387305 0.8840635 2.3620634 5083/7 NIL))
Let us see how successful the algorithm was:
(setq N (length data-center))
(setq results (loop for x in data collect (list (aref x (1- N)) (which-cluster x class-centers))))
(setq report nil)
(dolist (x results)
(if (assoc x report :test #'equal)
(incf (cdr (assoc x report :test #'equal)))
(push (cons x 1) report)))
report
(((3 1) . 18) ((3 0) . 30) ((2 2) . 1) ((2 0) . 20) ((2 1) . 50)
((1 0) . 13) ((1 2) . 46))
Consider the following 3x3 matrix where we display the way our clusters are formed.
| Cluster 0 | Cluster 1 | Cluster 2 | |
|---|---|---|---|
| Class 1 | 13 | 0 | 46 |
| Class 2 | 20 | 50 | 1 |
| Class 3 | 30 | 18 | 0 |
The table tells us the following. If the function
which-cluster returns 2 for a data point \(x\) using the centers we obtained from
k-means-train then it is likely that the data point \(x\) comes from a Class 1 wine. If the
return value is 1 then \(x\) most
likely comes from a Class 2 or Class 3 wine. Moreover, the odds of \(x\) being a Class 2 wine as opposed to a
Class 3 wine is approximately 5 to 2. If the function
which-cluster returns 0, the return value is not going to
be much of a use.