Like other classification problems I implemented before, the setup is similar: I have a finite set of points \(D := \{ x_1,\ldots,x_N \}\) from a metric space \((X,d)\) which I would like to write as a disjoint union of \(m\) subsets \[ D = D_1 \sqcup \cdots \sqcup D_m \] This time, I have a training set of examples \(T = \{ t_1,\ldots,t_L \}\) for which I have a classification scheme \[ c\colon T\to \{1,\ldots,m\} \] such that \(t_i\in D_{c(i)}\) for each \(i=1,\ldots,L\). Our problem is to extend \(c\) to a function \(c\colon X\to \{1,\ldots,m\}\).
Given a point \(x\in X\), we calculate its distances to each point in the set of training points. Then we rank the points in the training set from closest to the farthest. The majority of the classification labels within the first \(k\) determines which label the point \(x\) is going to get. In short, we ask the training set which label \(x\) should get, and the highest vote among the closest \(k\)-points wins.
Here is the pseudo-code for the algorithm
Function knn
Input: A finite set D of points to be classified
A finite set T of points
A function c: T -> {1,...,m}
A natural number k
Output: A function r: D -> {1,...,m}
Begin
Foreach x in D do
Let U <- {}
Foreach t in T add the pair ( d(x,t) , c(t) ) to U
Sort the pairs in U using the first components
Count the class labels from the first k elements from U
Let r(x) be the class with the highest number of occurence
End Foreach
Return r
End
Let me start by some utility functions. First some code to load the data
(defun read-data(file)
(with-open-file (stream file)
(let ((result (make-array 0 :fill-pointer t :adjustable t)))
(do ((line (read-line stream nil) (read-line stream nil)))
((null line) result)
(vector-push-extend (map 'vector 'read-from-string (ppcre:split #\, line)) result)))))
READ-DATA
I will start by using The Wine Dataset from UCI. I will load in the data and take a small sample of size 45.
(defvar data (read-data "wine.csv"))
(defvar N (length data))
(defvar train (map 'vector (lambda (x) (aref data (random N))) (make-array 45)))
TRAIN
We make the following assumption about the data: the data points are numerical vectors whose last entries are the class labels. Using this assumption we define the distance function as follows:
(defun distance (x y)
(sqrt (reduce '+ (map 'vector (lambda (i j k) (expt (- i j) 2)) x y (make-array (1- (length x)))))))
DISTANCE
and we read the class label as follows:
(defun read-label (x) (aref x (1- (length x))))
READ-LABEL
OK. Now let me test a part of the function: given a point calculate the distances of the point to the training data set and sort the results from smallest to the largest.
(defun process (x train)
(map 'vector 'car (sort (map 'list (lambda (y) (cons (read-label y) (distance x y))) train)
(lambda (u v) (< (cdr u) (cdr v))))))
PROCESS
And test it on a sample point:
(let ((x (aref data (random N))))
(cons (read-label x) (process x train)))
(1
. #(1 1 1 1 1 1 1 2 1 2 1 1 1 1 3 3 3 3 3 1 3 2 1 3 2 2 2 2 3 2 3 2 3 2 2
3 2 2 1 2 2 2 2 2 2))
What we need now is a function which will find the most frequent
class label from the first \(k\) entry.
Now comes our classify
function which will determine the
class label of a point.
(defun classify (x train k)
(let (result
(temp (process x train)))
(dotimes (i k)
(if (assoc (aref temp i) result :test 'equal)
(incf (cdr (assoc (aref temp i) result :test 'equal)))
(push (cons (aref temp i) 1) result)))
(caar (sort result (lambda (i j) (> (cdr i) (cdr j)))))))
CLASSIFY
and we test it
(let ((x (aref data (random N))))
(cons (read-label x) (classify x train 3)))
(3 . 1)
which is not successful for this choice.
Now, we apply our function to the whole dataset to assess how successful it is. For that I will need a comparison of class label given with the data and the class labels we calculate.
(defun make-table (sent)
(let (result)
(dolist (x sent)
(if (assoc x result :test 'equal)
(incf (cdr (assoc x result :test 'equal)))
(push (cons x 1) result)))
result))
MAKE-TABLE
Here is our final run with \(k\)=4
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data))
(((3 . 1) . 5) ((3 . 2) . 22) ((3 . 3) . 21) ((2 . 1) . 3) ((2 . 3) . 17)
((2 . 2) . 51) ((1 . 2) . 6) ((1 . 3) . 6) ((1 . 1) . 47))
Presented as a table we get
New | |||
---|---|---|---|
1 | 2 | ||
1 | 47 | 6 | |
Old | 2 | 3 | 51 |
3 | 5 | 22 |
Now, I will repeat the analysis with \(k\)=8
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 8))) data))
(((3 . 3) . 20) ((3 . 2) . 28) ((2 . 1) . 2) ((2 . 3) . 11) ((2 . 2) . 58)
((1 . 3) . 13) ((1 . 1) . 46))
Presented as a table we get
New | |||
---|---|---|---|
1 | 2 | ||
1 | 46 | 0 | |
Old | 2 | 2 | 58 |
3 | 0 | 28 |
If we apply \(\chi^2\)-test to the tables (and higher correlation between the rows and columns is better) we see
> chisq.test(matrix(c(47,6,6,3,51,17,5,22,21),nrow=3,ncol=3))
Pearson's Chi-squared test
data: matrix(c(47, 6, 6, 3, 51, 17, 5, 22, 21), nrow = 3, ncol = 3)
X-squared = 108.0064, df = 4, p-value < 2.2e-16
> chisq.test(matrix(c(46,0,13,2,58,11,0,28,20),nrow=3,ncol=3))
Pearson's Chi-squared test
data: matrix(c(46, 0, 13, 2, 58, 11, 0, 28, 20), nrow = 3, ncol = 3)
X-squared = 139.2728, df = 4, p-value < 2.2e-16
The second result shows better corelation. Of course we can change the number of sample points.
(setf train (map 'vector (lambda (x) (aref data (random N))) (make-array 70)))
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data))
(((3 . 1) . 8) ((3 . 2) . 25) ((3 . 3) . 15) ((2 . 1) . 7) ((2 . 3) . 21)
((2 . 2) . 43) ((1 . 3) . 1) ((1 . 1) . 58))
New
Old | 1 2 1 58 0 2 7 43 3 8 25 | rix(c(58,0,1 | ,7,43,21,8,25,15),nrow=3,ncol=3)) |
---|---|---|---|
Virginica Versicolor Setosa | Virginica 44 1 0 | Versicolor 6 49 0 | |
t me r | un another te | st with a di | fferent \(k\) ion the same data set: |
(mak (((I ((I ((I ((I | e-table (map RIS-VIRGINICA RIS-VIRGINICA RIS-VERSICOLO RIS-SETOSA . | ’list (lambd . IRIS-VERS . IRIS-VIRG R . IRIS-VER IRIS-SETOSA) | a (x) (cons (read-label x) (classify x train 5))) data)) ICOLOR) . 15) INICA) . 35) SICOLOR) . 50) . 50)) |
ain pr | esented in ta | ble format w | e see |
New |
Virginica Versicolor
Virginica 41 9
Old Versicolor 0 50 Setosa 0 0
which is on par with the result above.