k-Nearest
Neighbor Classification Algorithm Implemented in Lisp
Description of the problem
Like other classification problems I implemented before, the setup is
similar: I have a finite set of points from a metric space which I would like to write as a
disjoint union of subsets This
time, I have a training set of examples for which I have a classification scheme such that for each . Our problem is to extend
to a function .
The algorithm
Given a point , we
calculate its distances to each point in the set of training points.
Then we rank the points in the training set from closest to the
farthest. The majority of the classification labels within the first
determines which label the point
is going to get. In short, we ask
the training set which label
should get, and the highest vote among the closest -points wins.
The pseudo-code
Here is the pseudo-code for the algorithm
Function knn
Input: A finite set D of points to be classified
A finite set T of points
A function c: T -> {1,...,m}
A natural number k
Output: A function r: D -> {1,...,m}
Begin
Foreach x in D do
Let U <- {}
Foreach t in T add the pair ( d(x,t) , c(t) ) to U
Sort the pairs in U using the first components
Count the class labels from the first k elements from U
Let r(x) be the class with the highest number of occurence
End Foreach
Return r
End
An implementation in lisp
Let me start by some utility functions. First some code to load the
data
(defun read-data(file)
(with-open-file (stream file)
(let ((result (make-array 0 :fill-pointer t :adjustable t)))
(do ((line (read-line stream nil) (read-line stream nil)))
((null line) result)
(vector-push-extend (map 'vector 'read-from-string (ppcre:split #\, line)) result)))))
READ-DATA
I will start by using The Wine Dataset
from UCI. I will
load in the data and take a small sample of size 45.
(defvar data (read-data "wine.csv"))
(defvar N (length data))
(defvar train (map 'vector (lambda (x) (aref data (random N))) (make-array 45)))
TRAIN
We make the following assumption about the data: the data points are
numerical vectors whose last entries are the class labels. Using this
assumption we define the distance function as follows:
(defun distance (x y)
(sqrt (reduce '+ (map 'vector (lambda (i j k) (expt (- i j) 2)) x y (make-array (1- (length x)))))))
DISTANCE
and we read the class label as follows:
(defun read-label (x) (aref x (1- (length x))))
READ-LABEL
OK. Now let me test a part of the function: given a point calculate
the distances of the point to the training data set and sort the results
from smallest to the largest.
(defun process (x train)
(map 'vector 'car (sort (map 'list (lambda (y) (cons (read-label y) (distance x y))) train)
(lambda (u v) (< (cdr u) (cdr v))))))
PROCESS
And test it on a sample point:
(let ((x (aref data (random N))))
(cons (read-label x) (process x train)))
(1
. #(1 1 1 1 1 1 1 2 1 2 1 1 1 1 3 3 3 3 3 1 3 2 1 3 2 2 2 2 3 2 3 2 3 2 2
3 2 2 1 2 2 2 2 2 2))
What we need now is a function which will find the most frequent
class label from the first entry.
Now comes our classify
function which will determine the
class label of a point.
(defun classify (x train k)
(let (result
(temp (process x train)))
(dotimes (i k)
(if (assoc (aref temp i) result :test 'equal)
(incf (cdr (assoc (aref temp i) result :test 'equal)))
(push (cons (aref temp i) 1) result)))
(caar (sort result (lambda (i j) (> (cdr i) (cdr j)))))))
CLASSIFY
and we test it
(let ((x (aref data (random N))))
(cons (read-label x) (classify x train 3)))
(3 . 1)
which is not successful for this choice.
Now, we apply our function to the whole dataset to assess how
successful it is. For that I will need a comparison of class label given
with the data and the class labels we calculate.
(defun make-table (sent)
(let (result)
(dolist (x sent)
(if (assoc x result :test 'equal)
(incf (cdr (assoc x result :test 'equal)))
(push (cons x 1) result)))
result))
MAKE-TABLE
Here is our final run with =4
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data))
(((3 . 1) . 5) ((3 . 2) . 22) ((3 . 3) . 21) ((2 . 1) . 3) ((2 . 3) . 17)
((2 . 2) . 51) ((1 . 2) . 6) ((1 . 3) . 6) ((1 . 1) . 47))
Presented as a table we get
|
|
1 |
2 |
|
1 |
47 |
6 |
Old |
2 |
3 |
51 |
|
3 |
5 |
22 |
Now, I will repeat the analysis with =8
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 8))) data))
(((3 . 3) . 20) ((3 . 2) . 28) ((2 . 1) . 2) ((2 . 3) . 11) ((2 . 2) . 58)
((1 . 3) . 13) ((1 . 1) . 46))
Presented as a table we get
|
|
1 |
2 |
|
1 |
46 |
0 |
Old |
2 |
2 |
58 |
|
3 |
0 |
28 |
Statistical analysis of the
results
If we apply -test to the
tables (and higher correlation between the rows and columns is better)
we see
> chisq.test(matrix(c(47,6,6,3,51,17,5,22,21),nrow=3,ncol=3))
Pearson's Chi-squared test
data: matrix(c(47, 6, 6, 3, 51, 17, 5, 22, 21), nrow = 3, ncol = 3)
X-squared = 108.0064, df = 4, p-value < 2.2e-16
> chisq.test(matrix(c(46,0,13,2,58,11,0,28,20),nrow=3,ncol=3))
Pearson's Chi-squared test
data: matrix(c(46, 0, 13, 2, 58, 11, 0, 28, 20), nrow = 3, ncol = 3)
X-squared = 139.2728, df = 4, p-value < 2.2e-16
The second result shows better corelation. Of course we can change
the number of sample points.
(setf train (map 'vector (lambda (x) (aref data (random N))) (make-array 70)))
(make-table (map 'list (lambda (x) (cons (read-label x) (classify x train 4))) data))
(((3 . 1) . 8) ((3 . 2) . 25) ((3 . 3) . 15) ((2 . 1) . 7) ((2 . 3) . 21)
((2 . 2) . 43) ((1 . 3) . 1) ((1 . 1) . 58))
New
|
Virginica Versicolor Setosa |
Virginica 44 1 0 |
Versicolor 6 49 0 |
t me r |
un another te |
st with a di |
fferent ion the same data set: |
(mak (((I ((I ((I ((I |
e-table (map RIS-VIRGINICA RIS-VIRGINICA
RIS-VERSICOLO RIS-SETOSA . |
’list (lambd . IRIS-VERS . IRIS-VIRG R . IRIS-VER IRIS-SETOSA) |
a (x) (cons (read-label x) (classify x
train 5))) data)) ICOLOR) . 15) INICA) . 35) SICOLOR) . 50) . 50)) |
ain pr |
esented in ta |
ble format w |
e see |
|
|
|
New |
Virginica Versicolor
Virginica 41 9
Old Versicolor 0 50 Setosa 0 0
which is on par with the result above.
Older Posts
[2024-12-24] Number of isomorphism classes
of simple graph (continued)
[2024-12-22] Counting Isomorphism
Classes of Graphs
[2024-11-25] Connected Components
of Graphs
[2024-11-24] Counting connected
components of a graph
[2024-11-18] Counting Isomorphism
Classes of m-ary Trees
[2024-11-16] Number of
Isomorphism Classes of Ternary Trees
[2024-11-12] Hosoya Index of Balanced
Binary Trees
[2024-11-11] Hosoya Index of a
Graph
[2024-10-29] The Clique Number of a Simple
Graph
[2024-10-28] The Size of
Maximally Independent Subsets in a Graph
[2023-11-03] Graph
Algorithms in JGraphT with Common Lisp
[2023-10-28] An
Implementation of Pandas’ cut
and qcut
in
Lisp
[2023-07-24] A
Collatz-like Conjecture for the Projective Line
[2023-03-06] Twin
Primes, Cousin Primes, Sexy Primes, and Prime Triplets
[2023-03-02] Set
of All Partitions of a Finite Set
[2023-02-14] Non-crossing
Partitions and Dyck Words
[2023-02-13] Non-crossing
Linear Chords
[2023-02-04] Clojure/Python
Interop Examples
[2023-01-14] Graph
Algorithms in Clojure with JGraphT
[2022-03-29] 2D-Random Walk
[2022-03-28] Trade
Deficit vs Exchange Rate Curve
[2022-03-16] Working
with World Bank Data in Python
[2022-03-09] Working
with European Central Bank data in python (revisited)
[2022-01-24] A
Clique Analysis of Quakers in early modern Britain (1500-1700)
[2021-12-05] Boyer–Moore
and Misra-Gries Algorithms in Clojure
[2021-09-12] Tension in Text
Plotted
[2021-09-02] Statistical
Distributions using Apache Commons Math in Clojure
[2021-08-31] Reduce
with Intermediate Results in Common Lisp
[2021-08-21] Multivariate
Regression Implemented in Clojure
[2021-05-29] Using
Neural Networks to Detect Graph Properties
[2021-04-17] Fast
Null-Space Calculation via LU-Decomposition
[2021-02-24] Stoer-Wagner
Algorithm in Clojure
[2021-02-19] Calculating
Vertex Covers in Clojure
[2021-02-18] Listing All
Paths in a Graph
[2021-02-14] Strict
Dyck Words and Fibonacci Numbers
[2021-02-14] Kruskal’s
Algorithm in Common Lisp
[2021-02-13] Kruskal's
Algorithm Implemented in Clojure
[2021-02-10] An
integer dynamical system of integers
[2021-02-08] Binary
Symmetrization
[2021-01-28] Prüfer
Encoding and Decoding of a Tree in Clojure
[2021-01-27] Counting
Cycle-Free Paths in a Graph
[2021-01-27] Counting
Connected Labeled Graphs
[2020-12-18] Counting
Graphs with a Prescribed Degree Sequence
[2020-12-13] Havel–Hakimi
Algorithm in Clojure
[2020-12-12] Havel–Hakimi
Algorithm in Common-Lisp
[2020-10-23] The Quadratic
Casimir Element
[2020-07-04] Collatz Sequence
in Binary
[2020-07-02] A Lazy
Sequence of Primes in Clojure
[2020-06-10] Yet
Another Fizz-Buzz in Common Lisp
[2020-05-12] ECB
Data with Clojure and Vega-Lite
[2020-05-06] Processing
ECB Data with Common Lisp
[2020-04-17] Next
Permutation in the Lexicographical Ordering
[2020-04-13] Turkish
Hyphenation in Common Lisp
[2020-04-01] Using JavaPlex
with Clojure
[2019-11-05] Constricted
Arithmetic Progressions
[2019-11-03] The
Number of Arithmetic Progressions of Integers
[2019-05-06] Bron-Kerbosch
Algorithm in Clojure
[2019-05-01] An
Implementation of Ford-Fulkerson Algorithm in Clojure
[2019-04-22] Document
Summarization via Nonnegative Matrix Factorization
[2019-04-20] Latent
Semantic Analysis in Clojure
[2019-04-13] K-Nearest
Neighbors Algorithm in Clojure
[2019-04-06] K-Means
Implemented in Clojure
[2019-03-19] Prüfer
Encoding/Decoding of a Tree in Common Lisp
[2019-03-05] Gale-Shaply
Algorithm in Common Lisp
[2019-03-02] Calculating
The Correct Rank of a Matrix
[2018-12-04] Feed-forward
and back-propagation in neural networks as left- and right-fold
[2018-10-31] Nonnegative
Matrix Decomposition in Clojure
[2018-10-30] Non-negative
Matrix Decomposition in Scala
[2018-08-30] Working
with European Central Bank Data in Scala
[2018-07-30] Perverse
Sequences
[2018-05-28] Online
Perceptron in Common Lisp
[2018-05-26] Online Perceptron
[2018-05-18] Online Regression
[2018-05-06] Knut’s
Algorithm-S in Common Lisp
[2018-02-28] Irreducible Dyck
Words
[2018-02-19] Optimization
with GNU Scientific Library for Lisp
[2018-02-10] Van Eck’s
Sequence
[2018-02-09] Hiring
networks in mathematics
[2018-02-08] Linus Sequence
[2018-02-05] Egyptian
Fractions
[2018-02-01] Listing all
Young Tableaux
[2018-01-23] Collatz sequence
(yet again)
[2018-01-15] Hofstadter's Q
sequence
[2018-01-09] Farey Sequence
[2018-01-09] Catalan's
Triangle
[2018-01-06] The
Shoelace Formula for the Area of a Polygon
[2017-10-01] Working
with European Central Bank Data in Python
[2017-09-27] Expected
Value of the Diameter of a Tree
[2017-09-26] Using
Quandl with kixi.stats on Clojure
[2017-09-22] Using Quandl
with Common Lisp
[2017-08-05] Solving
Linear Equations in Natural Numbers
[2017-07-31] Transitive
Closure of a Directed Graph or a Relation
[2017-07-20] Steenrod-Milnor
and Tournament Sequences
[2017-07-15] A
lower bound on the radius of a graph
[2017-07-08] All partitions
of an integer
[2017-07-06] Some Hasse
Diagrams
[2017-07-04] Shuffles
[2017-07-03] Kaprekar Sequence
[2017-07-01] Lattice of Dyck
Words
[2017-06-28] The
poset of connected subgraphs of a connected graph
[2017-06-21] Calculating
the Diameter and the Radius of a Graph Using Tropic Linear
Algebra
[2017-06-19] Generating
random regular graphs
[2017-06-14] Estimating
the maximum element of a large list
[2017-06-09] A
Stochastic Gradient Descent Implementation in Clojure
[2017-06-06] A topology
problem
[2017-04-22] Listing duplicate
files
[2017-03-14] My First Idris
Proof
[2016-12-02] Distinguishing
hash functions (part II)
[2016-12-01] Distinguishing
hash functions
[2016-10-20] Hofstadter-Conway
$10,000 sequence
[2016-08-18] A
Solution for Problem 171 of 4Clojure
[2016-08-16] Puzzles and Group
Theory
[2016-08-13] Using Weka within
Lisp
[2016-07-12] Funniest
and Unfunniest Jokes in the Jester Dataset
[2016-07-05] Generating
Uniformly Random Connected Graphs
[2016-06-16] The
Robinson-Schensted Algorithm
[2016-06-01] Conjugate
Partitions
[2016-04-27] Using Word2Vec
from Clojure
[2016-04-24] Using
Word2Vec from Common Lisp
[2016-04-18] A Migration
Analysis
[2016-04-11] Basic
Data Analysis with CL without Frameworks
[2016-03-25] Parallel
map-reduce in Common Lisp
[2016-02-22] Text
Summarization and Topic Analysis
[2016-01-27] Set Covering
Problem
[2016-01-25] Kolmogorov-Smirnov
Test
[2016-01-20] Eigen-values
of the Laplacian and Connected Components of a Graph
[2015-12-12] Dual Graphs
[2015-10-26] Longest
Increasing Subsequence Revisited
[2015-10-16] Document
Summarization via Markov Chains
[2015-10-07] Computational
Literary Analysis
[2015-09-30] Library of
Babel in Common Lisp
[2015-09-28] Merging
Association Lists in Common Lisp
[2015-07-22] Cheapest
Paths via Tropic Matrices
[2015-07-21] Hidden
Markov Models via Tropic Matrices
[2015-07-08] A non-technical
post
[2015-06-28] An
implementation of the Viterbi algorithm in Common Lisp
[2015-05-28] Greatest
Common Divisor of Two Rational Numbers
[2015-05-21] Partitions
of Equal Measure Whatever the Measure May Be
[2015-05-14] Finding Cliques
in a Graph
[2015-05-12] Set Cover Problem
[2015-05-03] Threading
Macros in Common Lisp
[2015-05-03] Happy Numbers
[2015-05-01] Collatz Primes
[2015-04-23] Splitting Streams
[2015-04-06] Hamming
Distance and Double Hashing
[2015-04-05] Hamming
Distance and Hashing Functions
[2015-04-05] Hamming
Derivative of Hashing Functions
[2015-04-02] A Topology
Problem
[2015-03-21] Curve
Fitting is a Gram-Schmidt Reduction
[2015-03-08] Maximum
number of characters using keystrokes A, Ctrl+A, Ctrl+C and
Ctrl+V
[2015-03-06] Eccentricity,
Radius and Diameter in a Graph, Revisited
[2015-03-01] Graphs and
Entropy
[2015-02-22] Math PhD
Hiring Network (Part 3)
[2015-02-19] Math PhD
Hiring Network (Part 2)
[2015-02-18] Math PhD
Hiring Network (Part 1)
[2015-02-17] Faculty
Networks and Inequality in Hiring Practices in Universities
[2015-02-10] Functional
Streams in Lisp Explained
[2015-02-05] Collatz-type
Conjectures (Continued)
[2015-02-04] Collatz-type
Conjectures (Continued)
[2015-01-31] Collatz-type
Conjectures (Continued)
[2015-01-30] Collatz-type
Conjectures
[2015-01-28] Experiments
with Infinite Recursive Sequences (continued)
[2015-01-17] Experiments
with Infinite Recursive Sequences
[2015-01-10] Goldbach Pairs
[2015-01-02] Collatz Lengths
(Continued)
[2015-01-01] Functional
Streams
[2014-12-27] Polarization
in the US Congress
[2014-12-18] Partition a
sequence
[2014-11-28] Uniformly
Random Permutations
[2014-11-22] An
Implementation of Ford-Fulkerson Algorithm in Common Lisp
[2014-11-17] Tropic
Calculation of Cheapest Paths
[2014-11-05] Longest
common subsequence of two sequences
[2014-10-30] Counting
Spanning Trees of a Graph
[2014-10-26] Longest
Increasing Subsequence
[2014-10-24] The
Number of Inversions in a Sequence
[2014-10-22] Hashes and
Entropy
[2014-10-09] Estimating
Cardinality with Constant Memory Complexity
[2014-09-30] Landau's Function
[2014-09-29] A
Problem on Substitution Ciphers and Group Theory
[2014-09-28] A Morse Code
Translator
[2014-09-23] A
Memoization Macro for Common Lisp
[2014-09-21] Reducers are
Monoid Morphisms
[2014-09-18] Number
of isomorphism classes of binary trees
[2014-09-07] CONS is your
friend
[2014-08-22] A Zipf's Law
Simulation
[2014-08-07] Generating
Uniformly Random Trees
[2014-07-09] A Solution
for Project Euler #463
[2014-06-12] Entropy of
truncated MD5 hashing
[2014-06-08] Hexadecimal digits
of π
[2014-02-11] Information
content of n-grams
[2014-02-08] Turkish
Sentiment Analysis Using Thesaurus Distance
[2014-02-01] Sentiment
analysis using word distances
[2014-01-27] Phase
transitions in entropy
[2013-12-13] Optimal length of
n-grams
[2013-12-10] Counting
strings that contain intervals of same letter repetitions
[2013-12-02] Patterns
Separating Large Texts
[2013-11-23] Collatz
Sequences (Continued)
[2013-11-11] Entropy
and approximately one-to-one maps
[2013-10-23] Tree Isomorphism
[2013-10-15] Self Organizing
Maps
[2013-09-15] Euler Project
#401
[2013-09-15] An
additively recursive definition of the Moebius function
[2013-09-11] An
Unsuccessful Attempt for Solving Euler Project #401
[2013-09-04] Uniform
Sampling from Parametrized Submanifolds in Scala
[2013-09-04] Uniform
Sampling from Parametrized Submanifolds
[2013-08-30] Randomly
Generated Points Obeying a Distribution
[2013-08-25] Simulated
Annealing in Lisp
[2013-08-21] Eigenvalues
and Eigenvectors in GSLL
[2013-08-16] Reservoir
Sampling
[2013-08-11] Gibbs
sampling in lisp compared with C
[2013-08-10] Logistic
Regression in lisp
[2013-08-10] Linear
Discriminant Analysis in R
[2013-07-17] A
Gradient Descent Implementation in Lisp
[2013-07-01] k-Nearest
Neighbor Classification Algorithm Implemented in Lisp
[2013-05-19] Newton-Raphson
Method
[2013-05-07] Levenshtein
Distance
[2013-04-15] Cut points in a
graph
[2013-04-01] Experiments
on Collatz Lengths (Continued)
[2013-02-18] The
sound of the torsion parts of homotopy groups of spheres
[2013-02-12] Monadic Units
[2013-02-07] Distribution
of Collatz Lengths (continued)
[2013-02-03] Distribution
of Collatz Lengths
[2013-01-31] Quotients
of polynomial algebras
[2013-01-12] Path ideals
[2013-01-10] McCarthy91
Terminates
[2013-01-09] Finding
all paths in a directed graph
[2013-01-04] A
Simple Monte-Carlo Integration Implementation in Lisp
[2012-12-30] A
simple problem in Kolmogorov-Chaitin complexity
[2012-12-29] From walks to
paths
[2012-12-16] Higher
order functions, functors and monads
[2012-12-13] Eccentricity,
Radius and Diameter in an Undirected Graph
[2012-11-29] Untitled
[2012-11-25] Strictly
Increasing Labels of Directed Graphs
[2012-11-19] Strictly
Increasing Labellings of Directed Graphs
[2012-11-17] Nilpotent
elements in an artinian algebra
[2012-11-04] Local
rings, idempotents and non-invertible elements
[2012-10-18] An
implementation of the fixed-radius near neighbor clustering algorithm in
lisp
[2012-10-15] Reducing directed
graphs
[2012-10-10] An
implementation of the k-means
clustering algorithm in lisp
[2012-10-08] A
comparison of different map functions in lisp
[2012-10-03] Source code
entropy
[2012-09-28] Collisions in
random walks
[2012-09-26] Transitive
closure of a directed graph
[2012-09-26] Solving
linear equations in ℕ
[2012-09-26] Listing
partitions
[2012-09-26] Inverting
formal power series
[2012-09-26] Hasse
subgraph of a directed graph