The Kitchen Sink and Other Oddities

Atabey Kaygun

Kolmogorov-Smirnov Test

Description of the problem

In doing data analysis, sometimes one needs to test if two data sets are significantly different. Put it in a statistically verifiable statement: how do we check if two data sets viewed as random variables come from different distributions? One possible way of doing this is to employ Kolmogorov-Smirnov test.

Implementation

I am not going to try to implement my own version of the Kolmogorov distribution. I am going to use the C implementation written by Richard Simard and Pierre L’Ècuyer.

Download the C code together with the header file, and compile it using

gcc -fpic -shared KolmogorovSmirnovDist.c -lm -o ks.o

Let us define our own package in order not to pollute the global namespace.

(defpackage stats
   (:use common-lisp gsll cffi)
   (:import-from cffi load-foreign-library defcfun)
   (:import-from gsll make-random-number-generator sample))
#<PACKAGE "STATS">

(in-package :stats)
#<PACKAGE "STATS">

Now, we are ready to load the function KScdf from this library using cffi.

(load-foreign-library 
    "/home/kaygun/src/lisp/kolmogorov-smirnov/ks.so")
#<FOREIGN-LIBRARY KS.SO-664 "ks.so">

(defcfun ("KScdf" ks) :double (n :int) (x :double))
KS

I am going to need to get an empirical cumulative distribution function out of a given data set.

(defun cdf (xs)
  (let ((n (1- (length xs))))
    (lambda (x)
      (labels ((fn (a b)
                 (let ((m (truncate (/ (+ b a) 2))))
                   (cond
                     ((or (<= x (elt xs a))
                          (<= (- b a) 1))
                      (/ a n 1.0d0))
                     ((>= x (elt xs b))
                      (/ b n 1.0d0))
                     ((> (elt xs m) x) (fn a m))
                     (t (fn m b))))))
        (fn 0 n)))))
CDF

And, finally the test code:

(defun ks-test (xs ys)
  (multiple-value-bind (as bs)
      (if (>= (length xs) (length ys))
          (values xs ys)
          (values ys xs))
    (let* ((n (1- (length bs)))
           (fn (cdf (sort as #'<)))
           (cs (sort bs #'<))
           (d (loop for i from 0 below (length cs) maximize
                   (abs (- (funcall fn (elt cs i))
                           (/ i n 1d0))))))
      (cons d (- 1.0d0 (ks n d))))))
KS-TEST

Let us see how this works:

(let* ((n 200)
       (rng (make-random-number-generator +mt19937+))
       (xs (loop repeat n collect
               (sample rng :gaussian :sigma 1d0)))
       (ys (loop repeat (* 3 n) collect
               (sample rng :gaussian :sigma 1d0)))
       (zs (loop repeat (* 5 n) collect
               (+ 8d-1 (sample rng :gaussian :sigma 1.25d0)))))
    (list (ks-test xs ys)
          (ks-test xs zs)))
((0.07133329418377363d0 . 0.2511688455701715d0)
 (0.2042042042042042d0 . 1.8368679837110768d-4))

Analysis

Since p-value is about 0.25, the statistics we calculated for the first two datasets is not significant. We can’t say xs and ys are different. On the other hand, with statistical certainity we can say that xs and zs are different.