The Kitchen Sink and Other Oddities

Atabey Kaygun

Multivariate Regression Implemented in Clojure

Description of the problem

Today, I am going to write an implementation of multivariate regression in clojure with all statistical bells and whistles.

Dependencies

Here are the dependencies I have for this implementation:

:deps { net.mikera/core.matrix {:mvn/version "0.62.0"}
        net.mikera/vectorz-clj {:mvn/version "0.48.0"}
        org.apache.commons/commons-math3 {:mvn/version "3.5"}
}

Namespace

With the dependencies above, I will put everything I need into my namespace.

(ns regression
  (:use clojure.core.matrix)
  (:use clojure.core.matrix.operators)
  (:import [org.apache.commons.math3.distribution
           NormalDistribution
           TDistribution
           FDistribution]))

Implementation

Here is my implementation. The first term in the coefficients is the intercept.

(defn regression [xs y]
  (let [[k n] (shape xs)
        nk (- n k 1)
        tdist (TDistribution. nk)
        fdist (FDistribution. k nk)
        sig (fn [dist x] (* 2.0 (- 1.0 (.cumulativeProbability dist x))))
        X (join [(repeat n 1.0)] xs)
        XXt (mmul X (transpose X))
        beta (-> XXt inverse (mmul X y))
        yhat (mmul beta X)
        res (sub y yhat)
        ybar (/ (reduce + y) n)
        y-cen (sub y (repeat n ybar))
        yhat-cen (sub yhat (repeat n ybar))
        SSR (dot yhat-cen yhat-cen)
        SST (dot y-cen y-cen)
        SSRes (dot res res)
        se (map sqrt (scale (/ SSRes n) (diagonal XXt)))
        tscore (abs (map / beta se))
        F (/ (/ SSR k) (/ SSRes nk))
        R2 (/ SSR SST)]
    {:coeffs (seq beta)
     :se (seq se)
     :t0 (seq tscore)
     :tsig (map #(sig tdist %) tscore)
     :F F
     :Fsig (sig fdist F)
     :RSquare R2}))
#'regression/regression

Test

Let us test it on random data. The coefficients are -2 and 1 with intercept being 4.

(let [normal (NormalDistribution. 4.0 0.2)
      random-vector (fn [n] (take n (repeatedly rand)))
      X (array (join [(random-vector 10)]
                     [(random-vector 10)]))
      Y (add (mmul [-2 1] X)
             (seq (.sample normal 10)))]
  (regression X Y))
{:coeffs (4.0169543977679725 -1.9653087520525703 0.8539922022675653), :se (0.5189380093158233 0.30494977845244536 0.3768881235684564), :t0 (7.740721098969014 6.444696441578315 2.2659037227859207), :tsig (1.1242165296532214E-4 3.520298919033049E-4 0.05782572705551492), :F 45.39319256356398, :Fsig 1.9629026632062008E-4, :RSquare 0.9284153924812029}