The \(k\)-means clustering algorithm is probably one of the simplest clustering algorithms available. The set-up is simple: we have a metric space \((X,d)\) and a collection of points \(D = \{x_1,\ldots, x_n\}\) in \(X\) that we would like to write as a disjoint union of \(k\) subsets \[ D = D_1 \sqcup \cdots \sqcup D_k \] In most applications, the data is numerical and therefore we can assume that \(X\) is a subset of a finite dimensional vector space over \({\mathbb R}\). But if the data is categorical and can not be easily ranked, or ordered, then we really must work with an abstract metric space. Our algorithm works as follows:
Select an initial set of centers \(y^1_1,\ldots, y^1_k\) for our clusters \(D_1,\ldots, D_k\). The selection can be made randomly, or can be given by hand.
Assuming we have these centers \(y^m_1,\ldots,y^m_k\) at step \(m\), go over every point in our data set \(D\). Postulate that \(x\in D^m_i\) if the distance of \(x\) to \(y^m_i\) is minimal among all such centers. In other words \(x\in D^m_i\) if and only if \[ d(x,y^m_i) = min\{ d(x,y^m_j)\|\ j = 1,\ldots, k\} \]
Construct new centers \(y^{m+1}_i\) from the cluster subsets \(D^mi\) we formed in the previous step. If \((X,d)\) is a subset of a vector space then this is simply the average of all vectors in \(D^m_i\).
Repeat steps 2 and 3 until \(y^m_i = y^{m+1}_i\) for every \(i=1,\ldots,k\).
First, I need to write few utility functions.
(defun range (N) (loop for i from 0 below N collect i))
(defun select-column (collection column)
(map 'vector (lambda (x) (aref x column)) collection))
SELECT-COLUMN
I will assume that the data is given as a collection of comma
seperated entries one data point per line. I would like to write a
utility function so that I can select specific columns of data I would
like to consider. I will also need the regular expression library
cl-ppcre
(require :cl-ppcre)
The next utility function will take the name of the data file as its argument and load the data. It will return a list of entries.
(defun read-data(name)
(with-open-file (file name)
(let (line)
(loop while (setf line (read-line file nil)) collect
(if (> (length line) 0)
(map 'vector 'read-from-string (ppcre:split #\, line)))))))
Let us test the function on a sample data. The data comes from the UCI data repository.
(setq data (read-data "iris.csv"))
Our algorithm will need two fundamental functions both of which
should be specific to the data at hand. The first is the distance
function. Both distance
and cluster-center
functions we are going to define must depend on the structures of the
data points. For this data the \(L^1\)-distance [also known as the taxi
cab distance] is going to be our distance function.
(defun distance (x y)
(let ((N (length x)))
(loop for i from 0 below (1- N) sum (abs (- (aref x i) (aref y i))))))
Let us test the function on two points from our data set:
(setq NUM (length data))
(setq point-1 (nth (random NUM) data))
(setq point-2 (nth (random NUM) data))
(format nil "Distance between ~A and ~A is ~A~%" point-1 point-2 (distance point-1 point-2))
Distance between #(5.4 3.9 1.7 0.4 IRIS-SETOSA) and
#(6.8 3.0 5.5 2.1IRIS-VIRGINICA) is 7.8
The center of a collection of points is the point obtained by taking the averages of each coordinate.
(defun cluster-center (collection)
(let* ((N (length collection))
(m (length (car collection)))
(res (make-array m)))
(map-into res (lambda (x) (/ (reduce '+ (select-column collection x)) N)) (range (1- m)))
res))
Let us test this function on the whole dataset to find the center.
(setq data-center (cluster-center data))
#(5.8433347 3.054 3.7586665 1.1986669 NIL)
Now, the following function which-cluster
decides which
cluster a given the data point belongs to. It takes two arguments: a
data point x
and \(k\)-data points designated as cluster
centers centers
.
(defun which-cluster (x centers)
(let* ((N (length centers))
(class (aref x (1- (length x))))
(res (mapcar 'cons
(coerce (map 'vector (lambda (u) (distance x u)) centers) 'list)
(range N))))
(cdar (sort res (lambda (a b) (< (car a) (car b)))))))
The next function k-means-interate
takes two arguments:
a training data set training-data
and a set of data points
of size \(k\) for the clusters we would
like to form. It returns the next set of centers for the clusters.
(defun k-means-iterate (training-data centers)
(let ((clusters (make-array (length centers))))
(dolist (y training-data)
(push y (aref clusters (which-cluster y centers))))
(map 'vector 'cluster-center clusters)))
K-MEANS-ITERATE
One has to be careful here. It is possible that
k-means-iterate
returns less than \(k\) centers even if it is given exactly
\(k\) centers. In that case the
iteration will fail in the next run ungracefully. You are warned. I
might add extra code to deal with this boundary case but I’d rather
re-run the program until I get exactly \(k\) centers.
And finally we write the training function. This will take a training
dataset and iterate k-means-iterate
until the center points
for our clusters stabilize enough.
(defun k-means-train (training-data centers &optional (error 0.003))
(let ((N (length centers))
v u i)
(do* ((v centers u)
(u (k-means-iterate training-data v) (k-means-iterate training-data v)))
((< (loop for i from 0 below N sum (distance (aref u i) (aref v i))) error) u))))
K-MEANS-TRAIN
I will run the whole data set through the algorithm. I will use three centers, that is I would like to split the data set into 3 clusters.
(setq centers
(let* ((copy (coerce data 'vector))
(NUM (length copy)))
(map 'vector (lambda (x) (aref copy (random NUM))) #(1 2 3))))
#(#(6.4 2.9 4.3 1.3 IRIS-VERSICOLOR) #(6.0 2.2 5.0 1.5 IRIS-VIRGINICA)
#(4.6 3.2 1.4 0.2 IRIS-SETOSA))
Now, I will train the algorithm and it will give us the centers for
the clusters. Then using these cluster centers, I can figure out which
cluster does any given any point belong to using
which-cluster
function. I am going to use the error value
0.004
which was completely arbitrary. You are free to not
to pass an error bound. In that case the default value will be
0.003
.
(setf class-centers (k-means-train data centers 0.004))
#(#(5.006 3.418 1.4640001 0.24399994 NIL)
#(6.87027 3.0864863 5.7459464 2.0891888 NIL)
#(5.9047627 2.7460315 4.4126983 1.4333336 NIL))
Let us see how successful the algorithm was:
(setq N (length data-center))
(setq results (loop for x in data collect (list (aref x (1- N))
(which-cluster x class-centers))))
(setq report nil)
(dolist (x results)
(if (assoc x report :test #'equal)
(incf (cdr (assoc x report :test #'equal)))
(push (cons x 1) report)))
report
(((IRIS-VIRGINICA 2) . 15) ((IRIS-VIRGINICA 0) . 35)
((IRIS-VERSICOLOR 0) . 2) ((IRIS-VERSICOLOR 2) . 48)
((IRIS-SETOSA 1) . 50))
Consider the following 3x3 matrix where we display the way our clusters are formed.
Cluster 0 | Cluster 1 | CLuster 2 | |
---|---|---|---|
Versicolor | 2 | 0 | 48 |
Virginica | 35 | 0 | 15 |
Setosa | 0 | 50 | 0 |
The results indicate that if the return value from
which-cluster
function for a data point \(x\) was 1 using the centers provided by
k-means-train
function, the point \(x\) is very likely to be a Iris
Setosa. If the function returns 2 for \(x\) then \(x\) is very likely to be a Iris
Versicolor or Iris Virginica. Moreover, the odds that it
is a Iris Versicolor is approximately 3 times more than \(x\) being a Iris Virginica. If the
function returns 0, then \(x\) is very
likely to be a Iris Virginica
Ok. We are done with this data set. I am going to use another data set from UCI machine learning repository.
I am going to be lazy and instead of writing new
distance
and cluster-center
functions specific
to this new data set, I am going to modify the new data such that the
last column is now the class ascribed to the data point on each row.
(setq data (read-data "wine.csv"))
Ok. Let us check the distance function
(setq NUM (length data))
(setq point-1 (nth (random NUM) data))
(setq point-2 (nth (random NUM) data))
(format nil "Distance between ~A and ~A is ~A~%" point-1 point-2 (distance point-1 point-2))
Distance between #(12.77 2.39 2.28 19.5 86 1.39 0.51 0.48 0.64
9.899999 0.57 1.63 470 3) and
#(12.08 1.83 2.32 18.5 81 1.6 1.5 0.52 1.64 2.4 1.08 2.27 480 2) is 28.179998
and test the cluster-center
function on the whole
data
(setf data-center (cluster-center data))
#(13.000614 2.336348 2.3665185 19.494946 8877/89 2.2951121 2.02927
0.36185396 1.5908992 5.0580897 0.95744956 2.6116843 132947/178 NIL)
As before, I need 3 clusters. I will again recycle the code above for setting the first 3 cluster centers randomly:
(setq centers
(let* ((copy (coerce data 'vector))
(NUM (length copy)))
(map 'vector (lambda (x) (aref copy (random NUM))) #(1 2 3))))
#(#(13.03 0.9 1.71 16 86 1.95 2.03 0.24 1.46 4.6 1.19 2.48 392 2)
#(12.37 0.94 1.36 10.6 88 1.98 0.57 0.28 0.42 1.95 1.05 1.82 520 2)
#(12.37 1.63 2.3 24.5 88 2.22 2.45 0.4 1.9 2.12 0.89 2.78 342 2))
Now, let us run k-means-train
(setq class-centers (k-means-train data centers 0.004))
#(#(12.51191 2.4873528 2.2838235 20.776472 6271/68 2.0670583 1.7754412
0.38808835 1.4613234 4.074706 0.9419118 2.4957347 7757/17 NIL)
#(13.804467 1.8834045 2.42617 17.023403 4959/47 2.867234 3.014256
0.28531918 1.9104254 5.702553 1.0782979 3.1140423 56172/47 NIL)
#(12.9284115 2.5112698 2.4112694 19.955551 932/9 2.114444 1.5684129
0.39063492 1.4923812 5.6387305 0.8840635 2.3620634 5083/7 NIL))
Let us see how successful the algorithm was:
(setq N (length data-center))
(setq results (loop for x in data collect (list (aref x (1- N)) (which-cluster x class-centers))))
(setq report nil)
(dolist (x results)
(if (assoc x report :test #'equal)
(incf (cdr (assoc x report :test #'equal)))
(push (cons x 1) report)))
report
(((3 1) . 18) ((3 0) . 30) ((2 2) . 1) ((2 0) . 20) ((2 1) . 50)
((1 0) . 13) ((1 2) . 46))
Consider the following 3x3 matrix where we display the way our clusters are formed.
Cluster 0 | Cluster 1 | Cluster 2 | |
---|---|---|---|
Class 1 | 13 | 0 | 46 |
Class 2 | 20 | 50 | 1 |
Class 3 | 30 | 18 | 0 |
The table tells us the following. If the function
which-cluster
returns 2 for a data point \(x\) using the centers we obtained from
k-means-train
then it is likely that the data point \(x\) comes from a Class 1 wine. If the
return value is 1 then \(x\) most
likely comes from a Class 2 or Class 3 wine. Moreover, the odds of \(x\) being a Class 2 wine as opposed to a
Class 3 wine is approximately 5 to 2. If the function
which-cluster
returns 0, the return value is not going to
be much of a use.