k-Nearest Neighbor Classification Algorithm Implemented in Lisp

Description of the problem

Like other classification problems I implemented before, the setup is similar: I have a finite set of points $D := \{ x_1,\ldots,x_N \}$ from a metric space $(X,d)$ which I would like to write as a disjoint union of $m$ subsets \[ D = D_1 \sqcup \cdots \sqcup D_m \] This time, I have a training set of examples $T = \{ t_1,\ldots,t_L \}$ for which I have a classification scheme \[ c\colon T\to \{1,\ldots,m\} \] such that $t_i\in D_{c(i)}$ for each $i=1,\ldots,L$. Our problem is to extend $c$ to a function $c\colon X\to \{1,\ldots,m\}$.

The algorithm

Given a point $x\in X$, we calculate its distances to each point in the set of training points. Then we rank the points in the training set from closest to the farthest. The majority of the classification labels within the first $k$ determines which label the point $x$ is going to get. In short, we ask the training set which label $x$ should get, and the highest vote among the closest $k$-points wins.

The pseudo-code

Here is the pseudo-code for the algorithm

Function knn
Input: A finite set D of points to be classified
       A finite set T of points
       A function c: T -> {1,...,m}
       A natural number k
Output: A function r: D -> {1,...,m}
Begin
  Foreach x in D do
    Let U <- {}
    Foreach t in T add the pair ( d(x,t) , c(t) ) to U
    Sort the pairs in U using the first components
    Count the class labels from the first k elements from U
    Let r(x) be the class with the highest number of occurence
  End Foreach
  Return r
End

An implementation in lisp

Let me start by some utility functions. First some code to load the data

(defun read-data(file)
   (with-open-file (stream file)
       (let ((result (make-array 0 :fill-pointer t :adjustable t)))
            (do ((line (read-line stream nil) (read-line stream nil)))
                ((null line) result)
                (vector-push-extend (map 'vector 'read-from-string (ppcre:split #\, line)) result)))))
READ-DATA

I will start by using The Wine Dataset from UCI. I will load in the data and take a small sample of size 45.

(defvar data (read-data "wine.csv"))
(defvar N (length data))
(defvar train (map 'vector (lambda (x) (aref data (random N))) (make-array 45)))
TRAIN

We make the following assumption about the data: the data points are numerical vectors whose last entries are the class labels. Using this assumption we define the distance function as follows:

(defun distance (x y)
   (sqrt (reduce '+ (map 'vector (lambda (i j k) (expt (- i j) 2)) x y (make-array (1- (length x)))))))
DISTANCE

and we read the class label as follows:

(defun read-label (x) (aref x (1- (length x))))
READ-LABEL

OK. Now let me test a part of the function: given a point calculate the distances of the point to the training data set and sort the results from smallest to the largest.

(defun process (x train)
   (map 'vector 'car (sort (map 'list (lambda (y) (cons (read-label y) (distance x y))) train) 
                           (lambda (u v) (< (cdr u) (cdr v))))))
PROCESS

And test it on a sample point:

(let ((x (aref data (random N))))
   (cons (read-label x) (process x train)))
(1
 . #(1 1 1 1 1 1 1 2 1 2 1 1 1 1 3 3 3 3 3 1 3 2 1 3 2 2 2 2 3 2 3 2 3 2 2
     3 2 2 1 2 2 2 2 2 2))

What we need now is a function which will find the most frequent class label from the first $k$ entry. Now comes our classify function which will determine the class label of a point.

(defun classify (x train k)
   (let (result 
         (temp (process x train)))
       (dotimes (i k)
           (if (assoc (aref temp i) result :test 'equal)
              (incf (cdr (assoc (aref temp i) result :test 'equal)))
              (push (cons (aref temp i) 1) result)))
       (caar (sort result (lambda (i j) (> (cdr i) (cdr j)))))))
CLASSIFY

and we test it

(let ((x (aref data (random N))))
   (cons (read-label x) (classify x train 3)))
(3 . 1)

which is not successful for this choice.

Now, we apply our function to the whole dataset to assess how successful it is. For that I will need a comparison of class label given with the data and the class labels we calculate.

(defun make-table (sent)
   (let (result)
       (dolist (x sent) 
              (if (assoc x result :test 'equal)
                 (incf (cdr (assoc x result :test 'equal)))
                 (push (cons x 1) result)))
       result))
MAKE-TABLE

Here is our final run with $k$=4

(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data))
(((3 . 1) . 5) ((3 . 2) . 22) ((3 . 3) . 21) ((2 . 1) . 3) ((2 . 3) . 17)
 ((2 . 2) . 51) ((1 . 2) . 6) ((1 . 3) . 6) ((1 . 1) . 47))

Presented as a table we get

			New
		1	2
	1	47	6
Old	2	3	51
	3	5	22

Now, I will repeat the analysis with $k$=8

(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 8))) data))
(((3 . 3) . 20) ((3 . 2) . 28) ((2 . 1) . 2) ((2 . 3) . 11) ((2 . 2) . 58)
 ((1 . 3) . 13) ((1 . 1) . 46))

Presented as a table we get

			New
		1	2
	1	46	0
Old	2	2	58
	3	0	28

Statistical analysis of the results

If we apply $\chi^2$-test to the tables (and higher correlation between the rows and columns is better) we see

 > chisq.test(matrix(c(47,6,6,3,51,17,5,22,21),nrow=3,ncol=3))

         Pearson's Chi-squared test

 data:  matrix(c(47, 6, 6, 3, 51, 17, 5, 22, 21), nrow = 3, ncol = 3)
 X-squared = 108.0064, df = 4, p-value < 2.2e-16

 > chisq.test(matrix(c(46,0,13,2,58,11,0,28,20),nrow=3,ncol=3))

     Pearson's Chi-squared test

 data:  matrix(c(46, 0, 13, 2, 58, 11, 0, 28, 20), nrow = 3, ncol = 3)
 X-squared = 139.2728, df = 4, p-value < 2.2e-16

The second result shows better corelation. Of course we can change the number of sample points.

(setf train (map 'vector (lambda (x) (aref data (random N))) (make-array 70)))
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data))
(((3 . 1) . 8) ((3 . 2) . 25) ((3 . 3) . 15) ((2 . 1) . 7) ((2 . 3) . 21)
 ((2 . 2) . 43) ((1 . 3) . 1) ((1 . 1) . 58))

             New

Old	1 2 1 58 0 2 7 43 3 8 25	rix(c(58,0,1	,7,43,21,8,25,15),nrow=3,ncol=3))
	Virginica Versicolor Setosa	Virginica 44 1 0	Versicolor 6 49 0
t me r	un another te	st with a di	fferent $k$ ion the same data set:
(mak (((I ((I ((I ((I	e-table (map RIS-VIRGINICA RIS-VIRGINICA RIS-VERSICOLO RIS-SETOSA .	’list (lambd . IRIS-VERS . IRIS-VIRG R . IRIS-VER IRIS-SETOSA)	a (x) (cons (read-label x) (classify x train 5))) data)) ICOLOR) . 15) INICA) . 35) SICOLOR) . 50) . 50))
ain pr	esented in ta	ble format w	e see
			New

                 Virginica   Versicolor
    Virginica    41          9

Old Versicolor 0 50 Setosa 0 0

which is on par with the result above.

Older Posts

[2025-02-23] Counting Matroids

[2025-02-12] Sampling from a Random Variable

[2025-01-29] Markov Numbers

[2024-12-24] Number of isomorphism classes of simple graph (continued)

[2024-12-22] Counting Isomorphism Classes of Graphs

[2024-11-25] Connected Components of Graphs

[2024-11-24] Counting connected components of a graph

[2024-11-18] Counting Isomorphism Classes of m-ary Trees

[2024-11-16] Number of Isomorphism Classes of Ternary Trees

[2024-11-12] Hosoya Index of Balanced Binary Trees

[2024-11-11] Hosoya Index of a Graph

[2024-10-29] The Clique Number of a Simple Graph

[2024-10-28] The Size of Maximally Independent Subsets in a Graph

[2023-11-03] Graph Algorithms in JGraphT with Common Lisp

[2023-10-28] An Implementation of Pandas’ cut and qcut in Lisp

[2023-07-24] A Collatz-like Conjecture for the Projective Line

[2023-03-06] Twin Primes, Cousin Primes, Sexy Primes, and Prime Triplets

[2023-03-02] Set of All Partitions of a Finite Set

[2023-02-14] Non-crossing Partitions and Dyck Words

[2023-02-13] Non-crossing Linear Chords

[2023-02-04] Clojure/Python Interop Examples

[2023-01-14] Graph Algorithms in Clojure with JGraphT

[2022-03-29] 2D-Random Walk

[2022-03-28] Trade Deficit vs Exchange Rate Curve

[2022-03-16] Working with World Bank Data in Python

[2022-03-09] Working with European Central Bank data in python (revisited)

[2022-01-24] A Clique Analysis of Quakers in early modern Britain (1500-1700)

[2021-12-05] Boyer–Moore and Misra-Gries Algorithms in Clojure

[2021-09-12] Tension in Text Plotted

[2021-09-02] Statistical Distributions using Apache Commons Math in Clojure

[2021-08-31] Reduce with Intermediate Results in Common Lisp

[2021-08-21] Multivariate Regression Implemented in Clojure

[2021-05-29] Using Neural Networks to Detect Graph Properties

[2021-04-17] Fast Null-Space Calculation via LU-Decomposition

[2021-02-24] Stoer-Wagner Algorithm in Clojure

[2021-02-19] Calculating Vertex Covers in Clojure

[2021-02-18] Listing All Paths in a Graph

[2021-02-14] Strict Dyck Words and Fibonacci Numbers

[2021-02-14] Kruskal’s Algorithm in Common Lisp

[2021-02-13] Kruskal's Algorithm Implemented in Clojure

[2021-02-10] An integer dynamical system of integers

[2021-02-08] Binary Symmetrization

[2021-01-28] Prüfer Encoding and Decoding of a Tree in Clojure

[2021-01-27] Counting Cycle-Free Paths in a Graph

[2021-01-27] Counting Connected Labeled Graphs

[2020-12-18] Counting Graphs with a Prescribed Degree Sequence

[2020-12-13] Havel–Hakimi Algorithm in Clojure

[2020-12-12] Havel–Hakimi Algorithm in Common-Lisp

[2020-10-23] The Quadratic Casimir Element

[2020-07-04] Collatz Sequence in Binary

[2020-07-02] A Lazy Sequence of Primes in Clojure

[2020-06-10] Yet Another Fizz-Buzz in Common Lisp

[2020-05-12] ECB Data with Clojure and Vega-Lite

[2020-05-06] Processing ECB Data with Common Lisp

[2020-04-17] Next Permutation in the Lexicographical Ordering

[2020-04-13] Turkish Hyphenation in Common Lisp

[2020-04-01] Using JavaPlex with Clojure

[2019-11-05] Constricted Arithmetic Progressions

[2019-11-03] The Number of Arithmetic Progressions of Integers

[2019-05-06] Bron-Kerbosch Algorithm in Clojure

[2019-05-01] An Implementation of Ford-Fulkerson Algorithm in Clojure

[2019-04-22] Document Summarization via Nonnegative Matrix Factorization

[2019-04-20] Latent Semantic Analysis in Clojure

[2019-04-13] K-Nearest Neighbors Algorithm in Clojure

[2019-04-06] K-Means Implemented in Clojure

[2019-03-19] Prüfer Encoding/Decoding of a Tree in Common Lisp

[2019-03-05] Gale-Shaply Algorithm in Common Lisp

[2019-03-02] Calculating The Correct Rank of a Matrix

[2018-12-04] Feed-forward and back-propagation in neural networks as left- and right-fold

[2018-10-31] Nonnegative Matrix Decomposition in Clojure

[2018-10-30] Non-negative Matrix Decomposition in Scala

[2018-08-30] Working with European Central Bank Data in Scala

[2018-07-30] Perverse Sequences

[2018-05-28] Online Perceptron in Common Lisp

[2018-05-26] Online Perceptron

[2018-05-18] Online Regression

[2018-05-06] Knut’s Algorithm-S in Common Lisp

[2018-02-28] Irreducible Dyck Words

[2018-02-19] Optimization with GNU Scientific Library for Lisp

[2018-02-10] Van Eck’s Sequence

[2018-02-09] Hiring networks in mathematics

[2018-02-08] Linus Sequence

[2018-02-05] Egyptian Fractions

[2018-02-01] Listing all Young Tableaux

[2018-01-23] Collatz sequence (yet again)

[2018-01-15] Hofstadter's Q sequence

[2018-01-09] Farey Sequence

[2018-01-09] Catalan's Triangle

[2018-01-06] The Shoelace Formula for the Area of a Polygon

[2017-10-01] Working with European Central Bank Data in Python

[2017-09-27] Expected Value of the Diameter of a Tree

[2017-09-26] Using Quandl with kixi.stats on Clojure

[2017-09-22] Using Quandl with Common Lisp

[2017-08-05] Solving Linear Equations in Natural Numbers

[2017-07-31] Transitive Closure of a Directed Graph or a Relation

[2017-07-20] Steenrod-Milnor and Tournament Sequences

[2017-07-15] A lower bound on the radius of a graph

[2017-07-08] All partitions of an integer

[2017-07-06] Some Hasse Diagrams

[2017-07-04] Shuffles

[2017-07-03] Kaprekar Sequence

[2017-07-01] Lattice of Dyck Words

[2017-06-28] The poset of connected subgraphs of a connected graph

[2017-06-21] Calculating the Diameter and the Radius of a Graph Using Tropic Linear Algebra

[2017-06-19] Generating random regular graphs

[2017-06-14] Estimating the maximum element of a large list

[2017-06-09] A Stochastic Gradient Descent Implementation in Clojure

[2017-06-06] A topology problem

[2017-04-22] Listing duplicate files

[2017-03-14] My First Idris Proof

[2016-12-02] Distinguishing hash functions (part II)

[2016-12-01] Distinguishing hash functions

[2016-10-20] Hofstadter-Conway $10,000 sequence

[2016-08-18] A Solution for Problem 171 of 4Clojure

[2016-08-16] Puzzles and Group Theory

[2016-08-13] Using Weka within Lisp

[2016-07-12] Funniest and Unfunniest Jokes in the Jester Dataset

[2016-07-05] Generating Uniformly Random Connected Graphs

[2016-06-16] The Robinson-Schensted Algorithm

[2016-06-01] Conjugate Partitions

[2016-04-27] Using Word2Vec from Clojure

[2016-04-24] Using Word2Vec from Common Lisp

[2016-04-18] A Migration Analysis

[2016-04-11] Basic Data Analysis with CL without Frameworks

[2016-03-25] Parallel map-reduce in Common Lisp

[2016-02-22] Text Summarization and Topic Analysis

[2016-01-27] Set Covering Problem

[2016-01-25] Kolmogorov-Smirnov Test

[2016-01-20] Eigen-values of the Laplacian and Connected Components of a Graph

[2015-12-12] Dual Graphs

[2015-10-26] Longest Increasing Subsequence Revisited

[2015-10-16] Document Summarization via Markov Chains

[2015-10-07] Computational Literary Analysis

[2015-09-30] Library of Babel in Common Lisp

[2015-09-28] Merging Association Lists in Common Lisp

[2015-07-22] Cheapest Paths via Tropic Matrices

[2015-07-21] Hidden Markov Models via Tropic Matrices

[2015-07-08] A non-technical post

[2015-06-28] An implementation of the Viterbi algorithm in Common Lisp

[2015-05-28] Greatest Common Divisor of Two Rational Numbers

[2015-05-21] Partitions of Equal Measure Whatever the Measure May Be

[2015-05-14] Finding Cliques in a Graph

[2015-05-12] Set Cover Problem

[2015-05-03] Threading Macros in Common Lisp

[2015-05-03] Happy Numbers

[2015-05-01] Collatz Primes

[2015-04-23] Splitting Streams

[2015-04-06] Hamming Distance and Double Hashing

[2015-04-05] Hamming Distance and Hashing Functions

[2015-04-05] Hamming Derivative of Hashing Functions

[2015-04-02] A Topology Problem

[2015-03-21] Curve Fitting is a Gram-Schmidt Reduction

[2015-03-08] Maximum number of characters using keystrokes A, Ctrl+A, Ctrl+C and Ctrl+V

[2015-03-06] Eccentricity, Radius and Diameter in a Graph, Revisited

[2015-03-01] Graphs and Entropy

[2015-02-22] Math PhD Hiring Network (Part 3)

[2015-02-19] Math PhD Hiring Network (Part 2)

[2015-02-18] Math PhD Hiring Network (Part 1)

[2015-02-17] Faculty Networks and Inequality in Hiring Practices in Universities

[2015-02-10] Functional Streams in Lisp Explained

[2015-02-05] Collatz-type Conjectures (Continued)

[2015-02-04] Collatz-type Conjectures (Continued)

[2015-01-31] Collatz-type Conjectures (Continued)

[2015-01-30] Collatz-type Conjectures

[2015-01-28] Experiments with Infinite Recursive Sequences (continued)

[2015-01-17] Experiments with Infinite Recursive Sequences

[2015-01-10] Goldbach Pairs

[2015-01-02] Collatz Lengths (Continued)

[2015-01-01] Functional Streams

[2014-12-27] Polarization in the US Congress

[2014-12-18] Partition a sequence

[2014-11-28] Uniformly Random Permutations

[2014-11-22] An Implementation of Ford-Fulkerson Algorithm in Common Lisp

[2014-11-17] Tropic Calculation of Cheapest Paths

[2014-11-05] Longest common subsequence of two sequences

[2014-10-30] Counting Spanning Trees of a Graph

[2014-10-26] Longest Increasing Subsequence

[2014-10-24] The Number of Inversions in a Sequence

[2014-10-22] Hashes and Entropy

[2014-10-09] Estimating Cardinality with Constant Memory Complexity

[2014-09-30] Landau's Function

[2014-09-29] A Problem on Substitution Ciphers and Group Theory

[2014-09-28] A Morse Code Translator

[2014-09-23] A Memoization Macro for Common Lisp

[2014-09-21] Reducers are Monoid Morphisms

[2014-09-18] Number of isomorphism classes of binary trees

[2014-09-07] CONS is your friend

[2014-08-22] A Zipf's Law Simulation

[2014-08-07] Generating Uniformly Random Trees

[2014-07-09] A Solution for Project Euler #463

[2014-06-12] Entropy of truncated MD5 hashing

[2014-06-08] Hexadecimal digits of π

[2014-02-11] Information content of n-grams

[2014-02-08] Turkish Sentiment Analysis Using Thesaurus Distance

[2014-02-01] Sentiment analysis using word distances

[2014-01-27] Phase transitions in entropy

[2013-12-13] Optimal length of n-grams

[2013-12-10] Counting strings that contain intervals of same letter repetitions

[2013-12-02] Patterns Separating Large Texts

[2013-11-23] Collatz Sequences (Continued)

[2013-11-11] Entropy and approximately one-to-one maps

[2013-10-23] Tree Isomorphism

[2013-10-15] Self Organizing Maps

[2013-09-15] Euler Project #401

[2013-09-15] An additively recursive definition of the Moebius function

[2013-09-11] An Unsuccessful Attempt for Solving Euler Project #401

[2013-09-04] Uniform Sampling from Parametrized Submanifolds in Scala

[2013-09-04] Uniform Sampling from Parametrized Submanifolds

[2013-08-30] Randomly Generated Points Obeying a Distribution

[2013-08-25] Simulated Annealing in Lisp

[2013-08-21] Eigenvalues and Eigenvectors in GSLL

[2013-08-16] Reservoir Sampling

[2013-08-11] Gibbs sampling in lisp compared with C

[2013-08-10] Logistic Regression in lisp

[2013-08-10] Linear Discriminant Analysis in R

[2013-07-17] A Gradient Descent Implementation in Lisp

[2013-07-01] k-Nearest Neighbor Classification Algorithm Implemented in Lisp

[2013-05-19] Newton-Raphson Method

[2013-05-07] Levenshtein Distance

[2013-04-15] Cut points in a graph

[2013-04-01] Experiments on Collatz Lengths (Continued)

[2013-02-18] The sound of the torsion parts of homotopy groups of spheres

[2013-02-12] Monadic Units

[2013-02-07] Distribution of Collatz Lengths (continued)

[2013-02-03] Distribution of Collatz Lengths

[2013-01-31] Quotients of polynomial algebras

[2013-01-12] Path ideals

[2013-01-10] McCarthy91 Terminates

[2013-01-09] Finding all paths in a directed graph

[2013-01-04] A Simple Monte-Carlo Integration Implementation in Lisp

[2012-12-30] A simple problem in Kolmogorov-Chaitin complexity

[2012-12-29] From walks to paths

[2012-12-16] Higher order functions, functors and monads

[2012-12-13] Eccentricity, Radius and Diameter in an Undirected Graph

[2012-11-29] Untitled

[2012-11-25] Strictly Increasing Labels of Directed Graphs

[2012-11-19] Strictly Increasing Labellings of Directed Graphs

[2012-11-17] Nilpotent elements in an artinian algebra

[2012-11-04] Local rings, idempotents and non-invertible elements

[2012-10-18] An implementation of the fixed-radius near neighbor clustering algorithm in lisp

[2012-10-15] Reducing directed graphs

[2012-10-10] An implementation of the k-means clustering algorithm in lisp

[2012-10-08] A comparison of different map functions in lisp

[2012-10-03] Source code entropy

[2012-09-28] Collisions in random walks

[2012-09-26] Transitive closure of a directed graph

[2012-09-26] Solving linear equations in ℕ

[2012-09-26] Listing partitions

[2012-09-26] Inverting formal power series

[2012-09-26] Hasse subgraph of a directed graph