The Kitchen Sink and Other Oddities

Atabey Kaygun

Online Regression

Description of the problem

Assume that we have a functional relation between two variables of the form

\[ y \sim f(\alpha\cdot x + \beta) + \epsilon \]

where \(\epsilon\) is our error term, \(f\) is a known function, and \(x\) is a numerical vector. The parameters \(\alpha\) and \(\beta\) are unknown. Classical numerical multivariate regression uses \(f(x)=x\). If we have a collection of data points \({(y^{(i)},x^{(i)})}_{i\in I}\), can we find the best approximating \(\alpha\) and \(\beta\)?

Let me add another constraint: assume we do not have access to the whole data set, and we are allowed to make only one pass. Can we design an online algorithm to solve this problem?

A stochastic solution

Assume we have a data point \((y^{(i)},x^{(i)})\), and let us assume we already have an estimate \((\alpha^{(i)},\beta^{(i)})\). The error in our computation is

\[ \Delta y^{(i)} = y^{(i)} - f(\alpha^{(i)}\cdot x^{(i)} + \beta^{(i)}) \]

so that we have

\[ y^{(i)} - \Delta y^{(i)} = f(\alpha^{(i)}\cdot x^{(i)} +\beta^{(i)}) \]

Assume for now that \(f\) is invertible with an inverse \(g(x)\). Then

\[ g(y^{(i)} - \Delta y^{(i)}) \approx \alpha^{(i)}\cdot x^{(i)} + \beta^{(i)} - g'(y^{(i)})\Delta y^{(i)} \]

On the other hand, the derivative of the inverse function \(g(x)\) can be calculated from \(f(x)\)

\[ = \alpha^{(i)}\cdot x^{(i)} +\beta^{(i)} - \frac{\Delta y^{(i)}}{f'(\alpha^{(i)}\cdot x^{(i)} + \beta^{(i)})} = \alpha^{(i+1)}\cdot x^{(i)} + \beta^{(i+1)} \]

This means we get update rules of the form

\[ \beta^{(i+1)} \leftarrow \beta^{(i)} - \frac{\eta\Delta y^{(i)}}{f'(\alpha^{(i)}\cdot x^{(i)} + \beta^{(i)})} \]

and

\[ \alpha^{(i+1)} \leftarrow \alpha^{(i)} - \frac{\eta \Delta y^{(i)}}{f'(\alpha^{(i)}\cdot x^{(i)} + \beta^{(i)})} x^{(i)} \]

where \(\eta\) is the learning rate. Since this is an iterative solution, we can use this method to process the whole data without keeping the whole dataset in memory.

An implementation

Fist I am going to need two utility functions: the first is an implementation of the ordinary dot product of two vectors, the second is an implementation of numerical differentiation of a function at a given point.

(defun dot (xs ys) (reduce #'+ (map 'list #'* xs ys)))

(defun diff (fn x &optional (epsilon 1d-2))
  (let ((delta (/ epsilon 2)))
    (*  (- (funcall fn (+ x delta))
           (funcall fn (- x delta)))
        (/ epsilon))))
DOT
DIFF

For the regression function, I am going to assume the dataset is a stream, i.e. given by a function (with an internal state) that accepts no arguments and every time this function is called it returns a new data point until it exhaust all points and starts returning nil.

(defun regression (data fn &optional (eta 1d-2))
  (let* ((alpha (loop repeat (length (funcall data)) collect (- 1.0 (random 2.0)))))
    (do ((point (funcall data) (funcall data)))
    ((null point) alpha)
      (let* ((x (cons 1.0 (cdr point)))
         (y (car point))
         (calculated (dot alpha x))
         (derivative (diff fn calculated))
         (delta (- (funcall fn calculated) y))
         (update (/ delta derivative)))
    (loop for i from 0 below (length alpha) do
         (decf (elt alpha i) (* eta update (elt x i))))))))
REGRESSION

The following function creates an experiment dataset:

(defun generate-data (n m a b fn delta)
  (let ((res nil)
    (len (- b a)))
    (dotimes (i n res)
      (let* ((x (loop repeat m collect (+ (random len) a)))
         (y (+ (funcall fn x) (- (random delta) (/ delta 2.0)))))
    (push (cons y x) res)))))
GENERATE-DATA

Here \(n\) stands for the number of points created, \(m\) stands for number of features, \(a\) and \(b\) are the bounds for the features, \(fn\) is our function, and \(\delta\) is for (uniform) error term. However, the dataset is a list. We must turn this list into a stream:

(defun streamify (data)
  (let ((xs (copy-list data)))
    (lambda ()
      (let ((x (car xs)))
        (setf xs (cdr xs))
        x))))
STREAMIFY

Now, we are ready to write the experiment code:

(defun experiment (param a b n fn eta delta)
  (let* ((data (generate-data n (1- (length param)) a b (lambda (x) (funcall fn (dot param (cons 1.0 x)))) delta))
     (xs (streamify data))
     (alpha (regression xs fn eta))
     (n 0)
     (r-square 0))
    (dolist (point data)
      (let* ((x (cdr point))
         (y (car point))
         (z (funcall fn (dot alpha (cons 1.0 x)))))
    (incf n)
    (incf r-square (expt (- z y) 2))))
    (format nil "param = ~{~4,4f~^, ~}~%alpha = ~{~4,4f~^, ~}~%R2=~4,4f N=~d~%" param alpha r-square n)))
EXPERIMENT

Our first experiment is a standard regression:

(experiment '(1.0 -1.0 2.0 3.2 -1.2 0.8) -3.0 2.0 250000 (lambda (x) x) 0.025 0.05)
param = 1.0000, -1.0000, 2.0000, 3.2000, -1.2000, .8000
alpha = 1.0003, -.9984, 1.9983, 3.1977, -1.2001, .8003
R2=58.1951 N=250000

Pretty good. Now, with logistic function:

(defun logit (x) (/ 1.0 (+ 1.0 (exp (- x)))))

(experiment '(1.2 -1.8 2.1 3.4) 0.0 1.0 12000 #'logit 0.08 0.03)
LOGIT
param = 1.2000, -1.8000, 2.1000, 3.4000
alpha = 1.0552, -1.6287, 2.0517, 3.0794
R2=2.8681 N=12000

Still, pretty good.