This is the second post of my going back to basics phase. For today’s post I am going to implement k-nearest neighbors algorithm in clojure.
The set up is as follows: we have a cloud of data points embedded in an affine space. This just means that the points have numerical coordinates. Along with their coordinates, we also assign labels to each of these points. Labels could be anything: numbers, strings, letters etc. Our aim to come up with a classification scheme in which we produce a label when we are presented with a point whose label is not known. The k-nearest neighbors algorithm does the simplest thing: given a point \(x\), we find k-nearest points to \(x\) whose labels are known. Then we take a vote among these points. Largest vote wins and we assign the point \(x\) with that label.
First we need to define a distance function. Since I am going to use the distance function for comparisons, I do not need the actual value. I am going to use the Euclidean distance function but without the square root.
defn distance [xs ys]
(reduce + (map #(let [u (- %1 %2)] (* u u)) xs ys))) (
#'user/distance
Next, the implementation:
defn k-nn [x k data]
(let [cmp (fn [u v] (< (distance x u) (distance x v)))
(fn [ys y] (take k (sort cmp (cons y ys))))]
rdr (->> (keys data)
(reduce rdr [])
(map data)
(frequencies
apply max-key val)
(first)))
#'user/k-nn
The value k
is the number of neighbors of the point
x
we are going to use. Since we need to take a vote among
these k
points, k
must be an odd number. I am
going to assume data
is a map where the keys are the
coordinates of the points and values are the labels. Here is a small
example:
def data {[0 0] :a [1 0] :a [0 1] :a [1 1] :a [1 2] :a [-1 0] :a
(4 5] :b [5 4] :b [5 5] :b [5 6] :b [4 4] :b}) [
The local function cmp
compares distances to a given
point x
, and rdr
is a reducer function that
computes k-nearest points to x
.
Now, let us test:
1 2] 3 data) (k-nn [-
:a
I am going to recycle the code I wrote for the k-means code I wrote few days ago.
defn random-vector [xs sigma]
(map #(+ (- (rand (* 2 sigma)) sigma) %) xs))
(
def data (concat (repeatedly 6500 (fn [] {(random-vector (repeat 10 0.0) 0.2) :a}))
(repeatedly 7750 (fn [] {(random-vector (repeat 10 -1.0) 0.75) :b}))
(repeatedly 8000 (fn [] {(random-vector (repeat 10 2.0) 0.5) :c}))))
(
repeat 10 1.5) 1.0) 7 (into {} data)) (k-nn (random-vector (
#'user/random-vector
#'user/data
:c