In doing data analysis, sometimes one needs to test if two data sets are significantly different. Put it in a statistically verifiable statement: how do we check if two data sets viewed as random variables come from different distributions? One possible way of doing this is to employ Kolmogorov-Smirnov test.
I am not going to try to implement my own version of the Kolmogorov distribution. I am going to use the C implementation written by Richard Simard and Pierre L’Ècuyer.
Download the C code together with the header file, and compile it using
gcc -fpic -shared KolmogorovSmirnovDist.c -lm -o ks.o
Let us define our own package in order not to pollute the global namespace.
(defpackage stats
(:use common-lisp gsll cffi)
(:import-from cffi load-foreign-library defcfun)
(:import-from gsll make-random-number-generator sample))
#<PACKAGE "STATS">
(in-package :stats)
#<PACKAGE "STATS">
Now, we are ready to load the function KScdf
from this
library using cffi.
(load-foreign-library
"/home/kaygun/src/lisp/kolmogorov-smirnov/ks.so")
#<FOREIGN-LIBRARY KS.SO-664 "ks.so">
(defcfun ("KScdf" ks) :double (n :int) (x :double))
KS
I am going to need to get an empirical cumulative distribution function out of a given data set.
(defun cdf (xs)
(let ((n (1- (length xs))))
(lambda (x)
(labels ((fn (a b)
(let ((m (truncate (/ (+ b a) 2))))
(cond
((or (<= x (elt xs a))
(<= (- b a) 1))
(/ a n 1.0d0))
((>= x (elt xs b))
(/ b n 1.0d0))
((> (elt xs m) x) (fn a m))
(t (fn m b))))))
(fn 0 n)))))
CDF
And, finally the test code:
(defun ks-test (xs ys)
(multiple-value-bind (as bs)
(if (>= (length xs) (length ys))
(values xs ys)
(values ys xs))
(let* ((n (1- (length bs)))
(fn (cdf (sort as #'<)))
(cs (sort bs #'<))
(d (loop for i from 0 below (length cs) maximize
(abs (- (funcall fn (elt cs i))
(/ i n 1d0))))))
(cons d (- 1.0d0 (ks n d))))))
KS-TEST
Let us see how this works:
(let* ((n 200)
(rng (make-random-number-generator +mt19937+))
(xs (loop repeat n collect
(sample rng :gaussian :sigma 1d0)))
(ys (loop repeat (* 3 n) collect
(sample rng :gaussian :sigma 1d0)))
(zs (loop repeat (* 5 n) collect
(+ 8d-1 (sample rng :gaussian :sigma 1.25d0)))))
(list (ks-test xs ys)
(ks-test xs zs)))
((0.07133329418377363d0 . 0.2511688455701715d0)
(0.2042042042042042d0 . 1.8368679837110768d-4))
Since p-value is about 0.25, the statistics we calculated for the
first two datasets is not significant. We can’t say xs
and
ys
are different. On the other hand, with statistical
certainity we can say that xs
and zs
are
different.