The setup is the same as in my k-means clustering post. I have a metric space \((X,d)\) and a finite collection of points \(D = {x_1,\ldots,x_N}\) which I would like to write as a union of disjoint subsets \[ D = D_1 \sqcup \cdots \sqcup D_k \] This time, unlike the k-means algorithm, the number k is not fixed at the beginning. This time we feed the algorithm a number which I will call cluster diameter.
The diameter of a set of points in a metric space is defined by \[
diam© = \sup_{x,y\in C} d(x,y) \] If \(C\) happens to be finite then this is just
the maximumum of all possible distances one can write using pairs of
elements from the set \(C\). I will
need an auxiliary function I will call proximity
which is
defined as \[
\text{proximity}(x,C) = \sup_{y\in C} d(x,y) \]
The fixed-radius near neighbor clustering algorithm will form clusters of points whose diameters are less than or equal to a given radius.
So, here is our algorithm in pseudo code
Function frnn(D, radius)
1. Pick a point at random, call it p
2. Let the initial set of clusters be the set
which contains only the one point cluster {p}.
That is, let Clusters = { {p} }
3. For every x in D do
a. For every C in Clusters calculate the proximity of x to C
b. If promity of x is greater than radius for every C in Clusters
then add a new one point cluster {x} to Clusters.
c. Otherwise find the cluster C such that proximity(x,C) is
minimal and add x to that cluster.
return Clusters
Here is my implementation of the algorithm in lisp: I will assume
that data is given as a list of data-points, and that a distance
function called distance
is already defined for the data I
have.
(defun range (N) (loop for i from 0 below N collect i))
(defun proximity (x cluster)
(reduce 'max (mapcar (lambda (y) (distance x y)) cluster)))
(defun frnn (data radius)
(let* ((len (length data))
(p (car data))
(copy (cdr data))
(clusters (list (list p))))
(dolist (x copy)
(let* ((res (mapcar (lambda (C i) (cons (proximity x C) i)) clusters (range len)))
(which (if (> (length res) 1)
(car (sort res (lambda (u v) (< (car u) (car v)))))
(car res))))
(if (< (car which) radius)
(push x (nth (cdr which) clusters))
(push (list x) clusters))))
clusters))
The implementation I gave above cut some corners: instead of
assigning an initial point p
randomly, I took it as the
first element in the dataset. I should have get an element randomly then
copy
list should be obtained from data
by
deleting the chosen element p
.
Let us test this algorithm on the dataset we used to test our k-means clustering algorithm. I will copy the code I used there below together with the utility functions I used to load the data.
(require :cl-ppcre)
NIL
(defun read-data(name)
(with-open-file (file name)
(let (line)
(loop while (setf line (read-line file nil)) collect
(if (> (length line) 0)
(map 'vector 'read-from-string (ppcre:split #\, line)))))))
(setq data (read-data "iris.csv"))
(defun distance (x y)
(reduce '+ (map 'list (lambda (i j k) (abs (- i j k))) x y (make-array (1- (length x)) :initial-element 0))))
DISTANCE
Let us chose a point at random, a small sample at random and calculate the proximity measure to test the code:
(setq NUM (length data))
(setq point (nth (random NUM) data))
(setq sample (loop for i from 1 to 63 collect (nth (random NUM) data)))
(proximity point sample)
7.8999996
Now, since we don’t know how many clusters we are going to get depending on the radius, we must try few values and see how many clusters we will get together with the distribution of classes within each cluster.
(setq clusters (frnn data 5.5))
(length clusters)
3
Let us see the composition of these clusters
(setq report nil)
(setq N (1- (length (car data))))
(dolist (i (range (length clusters)))
(dolist (x (nth i clusters))
(if (assoc (cons (aref x N) i) report :test #'equal)
(incf (cdr (assoc (cons (aref x N) i) report :test #'equal)))
(push (cons (cons (aref x N) i) 1) report))))
report
(((IRIS-SETOSA . 2) . 50) ((IRIS-VERSICOLOR . 1) . 50)
((IRIS-VIRGINICA . 1) . 13) ((IRIS-VIRGINICA . 0) . 37))
The results indicate all of the samples from IRIS VERSICILOR are clustered together in cluster 0. Similarly samples from IRIS SETOSA are clustered together in cluster 2. Unfortunately, IRIS VIRGINICA samples are split between clusters 0 and 1, and if a sample falls within cluster 1 the odds that it is a IRIS VERSICOLOR against being a IRIS VIRGINICA is 4 to 1.
Let us try another radius and see the distribution in that case
(setq clusters (frnn data 8.9))
(setq report nil)
(setq N (1- (length (car data))))
(dolist (i (range (length clusters)))
(dolist (x (nth i clusters))
(if (assoc (cons (aref x N) i) report :test #'equal)
(incf (cdr (assoc (cons (aref x N) i) report :test #'equal)))
(push (cons (cons (aref x N) i) 1) report))))
report
(((IRIS-SETOSA . 1) . 50) ((IRIS-VERSICOLOR . 1) . 50)
((IRIS-VIRGINICA . 0) . 50))
We now have only two clusters and IRIS SETOSA and IRIS VERSICOLOR samples are clumped together while we can distinguish the samples from IRIS VIRGINICA from the rest of the samples.