The Kitchen Sink and Other Oddities

Atabey Kaygun

K-Nearest Neighbors Algorithm in Clojure

Description of the problem

This is the second post of my going back to basics phase. For today’s post I am going to implement k-nearest neighbors algorithm in clojure.

The set up is as follows: we have a cloud of data points embedded in an affine space. This just means that the points have numerical coordinates. Along with their coordinates, we also assign labels to each of these points. Labels could be anything: numbers, strings, letters etc. Our aim to come up with a classification scheme in which we produce a label when we are presented with a point whose label is not known. The k-nearest neighbors algorithm does the simplest thing: given a point \(x\), we find k-nearest points to \(x\) whose labels are known. Then we take a vote among these points. Largest vote wins and we assign the point \(x\) with that label.

An implementation

First we need to define a distance function. Since I am going to use the distance function for comparisons, I do not need the actual value. I am going to use the Euclidean distance function but without the square root.

(defn distance [xs ys]
   (reduce + (map #(let [u (- %1 %2)] (* u u)) xs ys)))
#'user/distance

Next, the implementation:

(defn k-nn [x k data]
   (let [cmp (fn [u v] (< (distance x u) (distance x v)))
         rdr (fn [ys y] (take k (sort cmp (cons y ys))))]
      (->> (keys data)
           (reduce rdr [])
           (map data)
           frequencies
           (apply max-key val)
           first)))
#'user/k-nn

The value k is the number of neighbors of the point x we are going to use. Since we need to take a vote among these k points, k must be an odd number. I am going to assume data is a map where the keys are the coordinates of the points and values are the labels. Here is a small example:

(def data {[0 0] :a [1 0] :a [0 1] :a [1 1] :a [1 2] :a [-1 0] :a 
           [4 5] :b [5 4] :b [5 5] :b [5 6] :b [4 4] :b})

The local function cmp compares distances to a given point x, and rdr is a reducer function that computes k-nearest points to x.

Now, let us test:

(k-nn [-1 2] 3 data)
:a

A larger example

I am going to recycle the code I wrote for the k-means code I wrote few days ago.

(defn random-vector [xs sigma]
   (map #(+ (- (rand (* 2 sigma)) sigma) %) xs))

(def data (concat (repeatedly 6500 (fn [] {(random-vector (repeat 10  0.0) 0.2) :a}))
                  (repeatedly 7750 (fn [] {(random-vector (repeat 10 -1.0) 0.75) :b}))
                  (repeatedly 8000 (fn [] {(random-vector (repeat 10  2.0) 0.5) :c}))))

(k-nn (random-vector (repeat 10 1.5) 1.0) 7 (into {} data))
#'user/random-vector
#'user/data
:c