The Kitchen Sink and Other Oddities

Atabey Kaygun

An implementation of the fixed-radius near neighbor clustering algorithm in lisp

Description of the problem and the algorithm

The setup is the same as in my k-means clustering post. I have a metric space \((X,d)\) and a finite collection of points \(D = {x_1,\ldots,x_N}\) which I would like to write as a union of disjoint subsets \[ D = D_1 \sqcup \cdots \sqcup D_k \] This time, unlike the k-means algorithm, the number k is not fixed at the beginning. This time we feed the algorithm a number which I will call cluster diameter.

The diameter of a set of points in a metric space is defined by \[ diam© = \sup_{x,y\in C} d(x,y) \] If \(C\) happens to be finite then this is just the maximumum of all possible distances one can write using pairs of elements from the set \(C\). I will need an auxiliary function I will call proximity which is defined as \[ \text{proximity}(x,C) = \sup_{y\in C} d(x,y) \]

The fixed-radius near neighbor clustering algorithm will form clusters of points whose diameters are less than or equal to a given radius.

So, here is our algorithm in pseudo code

Function frnn(D, radius)

1. Pick a point at random, call it p 
2. Let the initial set of clusters be the set
   which contains only the one point cluster {p}.
   That is, let Clusters = { {p} }
3. For every x in D do
   a.  For every C in Clusters calculate the proximity of x to C 
   b.  If promity of x is greater than radius for every C in Clusters 
       then add a new one point cluster {x} to Clusters.
   c.  Otherwise find the cluster C such that proximity(x,C) is 
       minimal and add x to that cluster.

 return Clusters

An implementation in lisp

Here is my implementation of the algorithm in lisp: I will assume that data is given as a list of data-points, and that a distance function called distance is already defined for the data I have.

(defun range (N) (loop for i from 0 below N collect i))

(defun proximity (x cluster)
     (reduce 'max (mapcar (lambda (y) (distance x y)) cluster)))

(defun frnn (data radius)
   (let* ((len (length data))
          (p (car data))
          (copy (cdr data))
          (clusters (list (list p))))
      (dolist (x copy)
                 (let* ((res (mapcar (lambda (C i) (cons (proximity x C) i)) clusters (range len)))
                        (which (if (> (length res) 1) 
                                  (car (sort res (lambda (u v) (< (car u) (car v)))))
                                  (car res))))
                     (if (< (car which) radius)
                        (push x (nth (cdr which) clusters))
                        (push (list x) clusters))))
      clusters))

The implementation I gave above cut some corners: instead of assigning an initial point p randomly, I took it as the first element in the dataset. I should have get an element randomly then copy list should be obtained from data by deleting the chosen element p.

Let us test this algorithm on the dataset we used to test our k-means clustering algorithm. I will copy the code I used there below together with the utility functions I used to load the data.

(require :cl-ppcre)
NIL


(defun read-data(name)
  (with-open-file (file name)
    (let (line)
        (loop while (setf line (read-line file nil)) collect
             (if (> (length line) 0) 
                (map 'vector 'read-from-string (ppcre:split #\, line)))))))

(setq data (read-data "iris.csv"))

(defun distance (x y) 
  (reduce '+ (map 'list (lambda (i j k) (abs (- i j k))) x y (make-array (1- (length x)) :initial-element 0))))
DISTANCE

Let us chose a point at random, a small sample at random and calculate the proximity measure to test the code:

(setq NUM (length data))
(setq point (nth (random NUM) data))
(setq sample (loop for i from 1 to 63 collect (nth (random NUM) data)))
(proximity point sample)
7.8999996

Now, since we don’t know how many clusters we are going to get depending on the radius, we must try few values and see how many clusters we will get together with the distribution of classes within each cluster.

(setq clusters (frnn data 5.5))
(length clusters)
3

Let us see the composition of these clusters

(setq report nil)
(setq N (1- (length (car data))))

(dolist (i (range (length clusters)))
   (dolist (x (nth i clusters))
      (if (assoc (cons (aref x N) i) report :test #'equal)
         (incf (cdr (assoc (cons (aref x N) i) report :test #'equal)))
         (push (cons (cons (aref x N) i) 1) report))))
report
(((IRIS-SETOSA . 2) . 50) ((IRIS-VERSICOLOR . 1) . 50)
 ((IRIS-VIRGINICA . 1) . 13) ((IRIS-VIRGINICA . 0) . 37))

The results indicate all of the samples from IRIS VERSICILOR are clustered together in cluster 0. Similarly samples from IRIS SETOSA are clustered together in cluster 2. Unfortunately, IRIS VIRGINICA samples are split between clusters 0 and 1, and if a sample falls within cluster 1 the odds that it is a IRIS VERSICOLOR against being a IRIS VIRGINICA is 4 to 1.

Let us try another radius and see the distribution in that case

(setq clusters (frnn data 8.9))

(setq report nil)
(setq N (1- (length (car data))))

(dolist (i (range (length clusters)))
   (dolist (x (nth i clusters))
      (if (assoc (cons (aref x N) i) report :test #'equal)
         (incf (cdr (assoc (cons (aref x N) i) report :test #'equal)))
         (push (cons (cons (aref x N) i) 1) report))))
report
(((IRIS-SETOSA . 1) . 50) ((IRIS-VERSICOLOR . 1) . 50)
 ((IRIS-VIRGINICA . 0) . 50))

We now have only two clusters and IRIS SETOSA and IRIS VERSICOLOR samples are clumped together while we can distinguish the samples from IRIS VIRGINICA from the rest of the samples.