A Gradient Descent Implementation in Lisp

Description of the problem

Assume we have a function of the form $f\colon \mathbb{R}^n\to \mathbb{R}$. We would like to find a local extremum point of this function iteratively starting from an initial point. I will use the method of gradient descent. In its iterative form it says that the recursive sequence \[ x_{n+1} = x_n - \gamma \nabla f(x_n) \] will converge to a local extremum point if $f$ is well-behaved and the initial choice $x_0$ it is close enough. Here, our parameters are the initial point $x_0$ and a iteration parameter $\gamma$.

A toy example

Let me implement a toy example. Let us find local extremum for a function $f(x)$ of a single real variable $x$. I don’t have to work really hard for this case: one can write a simple lisp function which returns the numerical derivative $f'(x)$ of the given function $f(x)$ and then I could plug the derivative function into the Newton-Raphson Algorithm to find out where local extrema are. But today I am going to follow a different path.

First, I need a function which would calculate the derivative of a function (as a number) at a point.

(defun diff (f x &optional (epsilon 1.0d-4))
   (/ (- (funcall f (+ x epsilon)) 
         (funcall f (- x epsilon))) 
      (* 2 epsilon)))
DIFF

Let me test this function. We all know that the derivative of $\sin(x)$ is $\cos(x)$. I will compare these functions on a random number between $0$ and $\pi$:

(let ((x (random (* 4 (atan 1))))) (cons (cos x) (diff 'sin x)))
(-0.39555392 . -0.39555389353740367d0)

Now, my implementation of the gradient descent for the toy case of a function of a single real variable:

(defun descent1 (fun x &key (error 1.0d-4) (rate 1.0d-2) (max-steps 10000))
   (do* ((y x (- y step))
         (step (* rate (diff fun y)) (* rate (diff fun y)))
         (n 0 (1+ n)))
        ((or (< (abs step) error) (>= n max-steps)) (if (< n max-steps) y))))
DESCENT1

I designed the function in such a way that if it can’t find a solution within the first 10000 steps (or some number of steps fed into the function) it will return nil.

The function $f(x)=x(1-x)$ has a local maximum at $x=0.5$. So, let me see if we can find this local extremum using the function descent1 we defined above:

(descent1 (lambda (x) (* x (1- x))) 1.1 :error 1.0d-6 :rate 3.5d-2)
0.5000139683370232d0

The general case

First, I will need a function $f(x)$ which will take a vector input $x$, another vector $v$ for direction to calculate the directional derivative $\nabla_v(f)\|_x$ of $f$ along $v$ at $x$.

(defun norm (x)
   (sqrt (reduce '+ (map 'vector (lambda (i) (expt i 2)) x))))

(defun nabla (f x v &optional (epsilon 1.0d-4))
   (let* ((dir (map 'vector (lambda (u) (* epsilon u)) v))
          (ter (map 'vector '+ x dir))
          (ini (map 'vector '- x dir)))
      (/ (- (funcall f ter) 
            (funcall f ini))
         (* 2 (norm dir)))))
NABLA

Let me take $f(x,y) = x^2 + y^2$ and find its partial derivatives at $(0,0)$.

(defun f (v)
   (reduce '+ (map 'vector (lambda (i) (expt i 2)) v)))

(cons (nabla 'f #(0 0) #(0 1)) (nabla 'f #(0 0) #(1 0)))
(0.0d0 . 0.0d0)

Before going into implementing my gradient descent, I will need some utility functions:

(defun range (n &optional (ini 0) (step 1))
   (loop for i from ini to n by step collect i))

(defun basis (i n)
   (map 'vector (lambda (j) (if (equal i j) 1 0)) (range n)))
BASIS

Let me test these:

(mapcar (lambda (i) (norm (basis i 10))) (range 10))
(1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0)

Now, the full implementation of the gradient descent.

(defun descent (fun x &key (error 1.0d-5) 
                           (rate 1.0d-2) 
                           (max-steps 10000))
  (let ((len (length x)))
     (do* ((y x (map 'vector '- y step))
           (step (map 'vector 
                      (lambda (i) (* rate (nabla fun y (basis i len)))) 
                      (range len))
                 (map 'vector 
                      (lambda (i) (* rate (nabla fun y (basis i len)))) 
                      (range len)))
           (n 0 (1+ n)))
          ((or (< (norm step) error) 
               (>= n max-steps)) 
           (if (< n max-steps) y)))))
DESCENT

We defined $f(x,y) = x^2 + y^2$ above earlier. Let me test the function descent on $f(x,y)$:

(let ((result (descent 'f #(1.0 1.0) :rate 3.3d-2)))
  (format nil "~%    We found a local extremum at x=~3,3F y=~3,3F~%" (aref result 0) (aref result 1)))

We found a local extremum at x=.000 y=.000

Let me test a more complicated function: $g(x,y,z) = \sin(x/2)\cos(y) + z^2$.

(defun g(x) (+ (* (sin (* 0.5 (aref x 0))) (cos (aref x 1))) (expt (aref x 2) 2)))

(let ((result (descent 'g #(-1.2 -0.2 0.9) :rate 2.3d-2 :error 1.d-6)))
  (format nil "~%    We found a local extremum at x=~3,3F y=~3,3F z=~3,3F~%" 
              (aref result 0) 
              (aref result 1) 
              (aref result 2)))

We found a local extremum at x=-3.141 y=-.000 z=.000

An application: fitting a regression line via gradient descent

For simplicity, assume we have a collection of points $(x_1,y_1),\ldots,(x_n,y_n)$ on the xy-plane and we want to fit an approximating line to this dataset. We will write an error function \[ E(a,b) = \frac{1}{2n}\sum_{i=1}^n (a x_i + b - y_i)^2 \] and then minimize this function using gradient descent.
I need a suitably random dataset.

(defvar a (- (random 8.6) 4.3))
(defvar b (- (random 6.8) 3.4))

(defvar mydata
   (map 'vector 
        (lambda (x) (coerce (list x (+ b (* a x) (- (random 0.8) 0.4))) 
                            'vector))
        (loop for i from 1 to 520 collect (random 2.0))))
MYDATA

First, I chose a random slope and a random intercept then I generated a set of 520 points from the interval $\[0,2\]$ and calculated $ax+b$ plus a small error term. Now, let me create the error function from this dataset. For this purpose, I will write a lisp function that returns my error function from a given dataset. In order to construct the error function I will take only a sample of the dataset. The percentage can be passed as an argument. The default is 5%.

(defun lse (raw-data &optional (sample-size 0.05))
   (let* ((n (length raw-data))
          (m (floor (* n sample-size)))
          (subset (loop for z from 1 to m 
                        collect (aref raw-data (random n)))))
       (lambda (x) (/ (reduce '+ 
                              (map 'list 
                                   (lambda (i) (expt (+ (aref x 1) 
                                                        (* (aref x 0) 
                                                           (aref i 0)) 
                                                        (- 0 (aref i 1))) 
                                               2)) 
                                   subset)) 
                      (* 2 m)))))
LSE

Here comes our error function and then our calculation of the best fitting line:

(defvar myerr (lse mydata 0.08))

(let ((result (descent myerr #(-2.2 -1.2))))
  (format nil "~%    We found the best fitting line at a=~3,3F b=~3,3F
The original values were: a=~3,3F and b=~3,3F~%" 
              (aref result 0) 
              (aref result 1) 
              a 
              b)) 

We found the best fitting line at a=-2.012 b=2.121
The original values were: a=-1.970 and b=2.119

The method can easily be modified to accommodate datasets of higher dimensions. We just need to rewrite the lse function for that purpose.