Assume we have a function of the form \(f\colon \mathbb{R}^n\to \mathbb{R}\). We would like to find a local extremum point of this function iteratively starting from an initial point. I will use the method of gradient descent. In its iterative form it says that the recursive sequence \[ x_{n+1} = x_n - \gamma \nabla f(x_n) \] will converge to a local extremum point if \(f\) is well-behaved and the initial choice \(x_0\) it is close enough. Here, our parameters are the initial point \(x_0\) and a iteration parameter \(\gamma\).
Let me implement a toy example. Let us find local extremum for a function \(f(x)\) of a single real variable \(x\). I don’t have to work really hard for this case: one can write a simple lisp function which returns the numerical derivative \(f'(x)\) of the given function \(f(x)\) and then I could plug the derivative function into the Newton-Raphson Algorithm to find out where local extrema are. But today I am going to follow a different path.
First, I need a function which would calculate the derivative of a function (as a number) at a point.
(defun diff (f x &optional (epsilon 1.0d-4))
(/ (- (funcall f (+ x epsilon))
(funcall f (- x epsilon)))
(* 2 epsilon)))
DIFF
Let me test this function. We all know that the derivative of \(\sin(x)\) is \(\cos(x)\). I will compare these functions on a random number between \(0\) and \(\pi\):
(let ((x (random (* 4 (atan 1))))) (cons (cos x) (diff 'sin x)))
(-0.39555392 . -0.39555389353740367d0)
Now, my implementation of the gradient descent for the toy case of a function of a single real variable:
(defun descent1 (fun x &key (error 1.0d-4) (rate 1.0d-2) (max-steps 10000))
(do* ((y x (- y step))
(step (* rate (diff fun y)) (* rate (diff fun y)))
(n 0 (1+ n)))
((or (< (abs step) error) (>= n max-steps)) (if (< n max-steps) y))))
DESCENT1
I designed the function in such a way that if it can’t find a
solution within the first 10000 steps (or some number of steps fed into
the function) it will return nil
.
The function \(f(x)=x(1-x)\) has a
local maximum at \(x=0.5\). So, let me
see if we can find this local extremum using the function
descent1
we defined above:
(descent1 (lambda (x) (* x (1- x))) 1.1 :error 1.0d-6 :rate 3.5d-2)
0.5000139683370232d0
First, I will need a function \(f(x)\) which will take a vector input \(x\), another vector \(v\) for direction to calculate the directional derivative \(\nabla_v(f)\|_x\) of \(f\) along \(v\) at \(x\).
(defun norm (x)
(sqrt (reduce '+ (map 'vector (lambda (i) (expt i 2)) x))))
(defun nabla (f x v &optional (epsilon 1.0d-4))
(let* ((dir (map 'vector (lambda (u) (* epsilon u)) v))
(ter (map 'vector '+ x dir))
(ini (map 'vector '- x dir)))
(/ (- (funcall f ter)
(funcall f ini))
(* 2 (norm dir)))))
NABLA
Let me take \(f(x,y) = x^2 + y^2\) and find its partial derivatives at \((0,0)\).
(defun f (v)
(reduce '+ (map 'vector (lambda (i) (expt i 2)) v)))
(cons (nabla 'f #(0 0) #(0 1)) (nabla 'f #(0 0) #(1 0)))
(0.0d0 . 0.0d0)
Before going into implementing my gradient descent, I will need some utility functions:
(defun range (n &optional (ini 0) (step 1))
(loop for i from ini to n by step collect i))
(defun basis (i n)
(map 'vector (lambda (j) (if (equal i j) 1 0)) (range n)))
BASIS
Let me test these:
(mapcar (lambda (i) (norm (basis i 10))) (range 10))
(1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0)
Now, the full implementation of the gradient descent.
(defun descent (fun x &key (error 1.0d-5)
(rate 1.0d-2)
(max-steps 10000))
(let ((len (length x)))
(do* ((y x (map 'vector '- y step))
(step (map 'vector
(lambda (i) (* rate (nabla fun y (basis i len))))
(range len))
(map 'vector
(lambda (i) (* rate (nabla fun y (basis i len))))
(range len)))
(n 0 (1+ n)))
((or (< (norm step) error)
(>= n max-steps))
(if (< n max-steps) y)))))
DESCENT
We defined \(f(x,y) = x^2 + y^2\)
above earlier. Let me test the function descent
on \(f(x,y)\):
(let ((result (descent 'f #(1.0 1.0) :rate 3.3d-2)))
(format nil "~% We found a local extremum at x=~3,3F y=~3,3F~%" (aref result 0) (aref result 1)))
We found a local extremum at x=.000 y=.000
Let me test a more complicated function: \(g(x,y,z) = \sin(x/2)\cos(y) + z^2\).
(defun g(x) (+ (* (sin (* 0.5 (aref x 0))) (cos (aref x 1))) (expt (aref x 2) 2)))
(let ((result (descent 'g #(-1.2 -0.2 0.9) :rate 2.3d-2 :error 1.d-6)))
(format nil "~% We found a local extremum at x=~3,3F y=~3,3F z=~3,3F~%"
(aref result 0)
(aref result 1)
(aref result 2)))
We found a local extremum at x=-3.141 y=-.000 z=.000
For simplicity, assume we have a collection of points \((x_1,y_1),\ldots,(x_n,y_n)\) on the
xy-plane and we want to fit an approximating line to this dataset. We
will write an error function \[
E(a,b) = \frac{1}{2n}\sum_{i=1}^n (a x_i + b - y_i)^2 \] and then
minimize this function using gradient descent.
I need a suitably random dataset.
(defvar a (- (random 8.6) 4.3))
(defvar b (- (random 6.8) 3.4))
(defvar mydata
(map 'vector
(lambda (x) (coerce (list x (+ b (* a x) (- (random 0.8) 0.4)))
'vector))
(loop for i from 1 to 520 collect (random 2.0))))
MYDATA
First, I chose a random slope and a random intercept then I generated a set of 520 points from the interval \(\[0,2\]\) and calculated \(ax+b\) plus a small error term. Now, let me create the error function from this dataset. For this purpose, I will write a lisp function that returns my error function from a given dataset. In order to construct the error function I will take only a sample of the dataset. The percentage can be passed as an argument. The default is 5%.
(defun lse (raw-data &optional (sample-size 0.05))
(let* ((n (length raw-data))
(m (floor (* n sample-size)))
(subset (loop for z from 1 to m
collect (aref raw-data (random n)))))
(lambda (x) (/ (reduce '+
(map 'list
(lambda (i) (expt (+ (aref x 1)
(* (aref x 0)
(aref i 0))
(- 0 (aref i 1)))
2))
subset))
(* 2 m)))))
LSE
Here comes our error function and then our calculation of the best fitting line:
(defvar myerr (lse mydata 0.08))
(let ((result (descent myerr #(-2.2 -1.2))))
(format nil "~% We found the best fitting line at a=~3,3F b=~3,3F
The original values were: a=~3,3F and b=~3,3F~%"
(aref result 0)
(aref result 1)
a
b))
We found the best fitting line at a=-2.012 b=2.121
The original values were: a=-1.970 and b=2.119
The method can easily be modified to accommodate datasets of higher
dimensions. We just need to rewrite the lse
function for
that purpose.