The Kitchen Sink and Other Oddities

Atabey Kaygun

Sampling from a Random Variable

Description of the problem

A random variable is essentially a probability density function (PDF) \(p\colon \Omega\to [0,\infty)\) (or sometimes just referred as a probability measure or as a (probability) distribution) defined on a measurable space. The notation for such a choice is \(X\sim p\) if \(X\) is a random variable taking values in a measurable space \(\Omega\) with a PDF \(p\). Today, I’ll keep things simple and work with random variables which take values in the set of real numbers. These are usually referred as continuous random variables.

Here is the problem I am going to tackle today: We would like to take a finite sample of numbers \(x_1,\ldots,x_N\) in such a way that the number of samples \(N_{a,b}\) in an interval \([a,b]\) is roughly \(N\) times the probability of the random variable \(X\) being in \([a,b]\). This number can be written as an integral \[ N_{a,b} \approx N \int_a^b p(x) dx \]

Uniform random sampling

Simulating random numbers, or random sampling from a finite collection of elements, on a computer is a tricky business. What I am going to do today assumes that I have a good uniform random sampling simulation.

The uniform PDF is a density function that looks the same at every point. So, I want to have a uniform random variable on the interval \([a,b]\) my PDF should be \(p(x) = 0\) when \(x\) is outside of the interval \([a,b]\), and \(p(x)\) must be the constant \(\frac{1}{b-a}\) when \(x\) is in \([a,b]\). The choice of the constant is dictated by the fact that \(p(x)\) should integrate to 1 on the whole space. For sampling simulation from a uniform random variable in \([0,1]\) I am going to use lisp’s random function:

(loop repeat 10 collect (random 1.0))
(0.24052 0.6231128 0.20052421 0.74887836 0.345209 0.107444644 0.18069339
 0.8515891 0.5105244 0.76619613)

Sampling from a specific random variable.

OK. Assuming we have a good uniform sampling simulation, how can we simulate sampling from a random variable \(X\sim p\)?

Now, each PDF \(p(x)\) has a cumulative distribution function (CDF) defined via an integral \[ c(x) = \int_{-\infty}^x p(t)dt \] Since PDF function \(p(t)\) is positive, the function defined by the integral is non-decreasing which means the CDF function \(c(x)\) is invertible almost everywhere if \(p(x)\) is nice enough. Moreover, if \(x_1,\ldots,x_N\) is a sample from the uniform distribution on \([0,1]\), the numbers \[ c^{-1}(x_1),\ldots,c^{-1}(x_N) \] is a sample from \(X\sim p\).

Approximating CDF

Now, let us assume we have the PDF \(p(x)\) on a fixed finite interval \([a,b]\), and let us subdivide the interval into equal pieces of length \(\epsilon>0\) \[ x_i = a + i\cdot \epsilon \text{ where } i=0,\ldots,\lfloor (b-a)/\epsilon \rfloor \] and consider the numbers \(p(x_1),\ldots,p(x_N)\). We can approximate the CDF via a Riemann sum

\[ c(x) \approx \epsilon\sum_{i=0}^{\lfloor x/\epsilon \rfloor} p(x_i) \]

I can do that in lisp. First, I’ll need a reduce function which returns the partial applications.

(defun reduce-with-intermediates (fn xs &optional carry)
  (if (null xs)
      (nreverse carry)
      (let ((res (if (null carry)
                     (funcall fn (car xs))
                     (funcall fn (car xs) (car carry)))))
        (reduce-with-intermediates fn (cdr xs) (cons res carry)))))
REDUCE-WITH-INTERMEDIATES

Let us test. I am going to convert the sine function into PDF

(defparameter epsilon 0.0001)

(defparameter sine-cdf-xs 
   (loop for x from 0 to 1.0 by epsilon collect x))

(defparameter sine-cdf-ys
   (reduce-with-intermediates
           #'+
           (mapcar (lambda (x) (* epsilon (/ pi 2.0) (sin (* pi x)))) 
                   sine-cdf-xs)))
EPSILON
SINE-CDF-XS
SINE-CDF-YS

I did have to modify the sine function by a suitable constant to make this look like a CDF.

So, we get a sample of outputs from sine-cdf function. But, we don’t want to approximate sine-cdf, we want the inverse of it. Before we dive into it, let us first discuss interpolation.

Interpolation

Given a list of pairs \((x_1,y_1),\ldots,(x_N,y_N)\) if we want a function \(f(x)\) that interpolates these points we would need the interval \(i\) with \(x\in [x_i,x_{i+1}]\) and then we would write \[ f(x) = \frac{x_{i+1}-x}{x_{i+1}-x_i} y_i + \frac{x-x_i}{x_{i+1}-x_i} y_{i+1} \] if we wanted to linearly interpolate. However, I am going to use a much simpler scheme: I’ll use the left-end-point \[ f(x) = y_i \]

Let us implement that:

(defun middle (a b)
   (let ((c (+ a b)))
      (if (evenp c)
          (/ c 2)
          (/ (1- c) 2))))

(defun place (x xs &optional (a 0) (b (length xs)))
   (let ((c (middle a b)))
      (cond ((<= (- b a) 1) a)
            ((>= x (elt xs c)) (place x xs c b))
            (t (place x xs a c)))))

(defun interpolate (xs ys)
   (lambda (x) (elt ys (place x xs))))
MIDDLE
PLACE
INTERPOLATE

Let us test on the sine function

(let* ((xs (sort (loop repeat 500 collect (random pi)) #'<))
       (ys (mapcar #'sin xs))
       (approx (interpolate xs ys))
       (zs (loop repeat 20 collect (random pi))))
  (format nil "~{~{~2,4f ~2,4f~%~}~}" (mapcar #'list (mapcar #'sin zs) (mapcar approx zs))))
.9527 .9526
.7193 .7226
.3221 .3220
.9819 .9814
.2509 .2636
.2576 .2636
.9685 .9691
.2978 .3008
.8657 .8691
.3628 .3625
.9984 .9982
.0427 .0367
.9509 .9503
.4417 .4583
.8723 .8688
.9999 1.0000
.7360 .7405
.3575 .3626
.1077 .1046
.8791 .8792

Interpolation of inverse functions

For inverse functions, we do something very similar: we find the interval number \(i\) that satisfies \(x\in [f(x_i),f(x_{i+1})]\) then let \(f^{-1}(x)\) be an interpolation of \(x_i\) and \(x_{i+1}\): \[ f^{-1}(x) = \frac{f(x_{i+1})-x}{f(x_{i+1})-f(x_i)}x_i + \frac{x-f(x_i)}{f(x_{i+1})-f(x_i)}x_{i+1} \] However, in our case this would be the left-end-point: \[ f^{-1}(x) = x_i \]

In other words, in both cases, we interpolate the inverse function by reversing \((x_i,y_i)\)’s to \((y_i,x_i)\)’s.

Let us test on the sine function again. But remember that sine is invertible only on \([0,\pi/2]\):

(let* ((xs (sort (loop repeat 500 collect (random (/ pi 2))) #'<))
       (ys (mapcar #'sin xs))
       (approx (interpolate ys xs))
       (zs (sort (loop repeat 20 collect (random 1.0d0)) #'<)))
  (format nil "~{~{~2,4f ~2,4f~%~}~}" (mapcar #'list (mapcar #'asin zs) (mapcar approx zs))))
.0297 .0280
.0479 .0451
.0861 .0777
.1197 .1144
.1353 .1352
.2970 .2968
.3188 .3186
.3216 .3186
.4984 .4926
.7063 .7009
.7110 .7068
.8620 .8552
.8935 .8909
.9680 .9629
1.0176 1.0126
1.1661 1.1652
1.2155 1.2144
1.2319 1.2256
1.2635 1.2632
1.3829 1.3829

Approximating the inverse of CDF

Let us test the inversion on the sine-cdf:

(let* ((sine-cdf (interpolate sine-cdf-xs sine-cdf-ys))
       (sine-cdf-inverse (interpolate sine-cdf-ys sine-cdf-xs))
       (xs (sort (loop repeat 20 collect (random 1.0d0)) #'<))
       (ys (mapcar sine-cdf xs))
       (zs (mapcar #'list xs (mapcar sine-cdf-inverse ys))))
  (format nil "~{~{~2,4f ~2,4f~%~}~}" zs))
.0888 .0888
.0900 .0900
.1540 .1539
.2175 .2175
.2193 .2192
.2420 .2419
.2652 .2652
.4074 .4074
.4357 .4357
.5075 .5075
.5605 .5604
.6556 .6556
.6667 .6666
.6740 .6740
.7127 .7126
.7499 .7498
.7996 .7995
.8755 .8755
.9484 .9483
.9705 .9704

Random sampling from an arbitrary random variable

Given a PDF \(f(x)\), first I would need the CDF \(F(x)\) from \(f(x)\). In the absence of the function, we would need a dense enough sample \((x_1,p_1),\ldots,(x_N,p_N)\) of points and associated probabilities to write an interpolation of the CDF using partial sums \[ y_i = \sum_{j=1}^i p_j \] to get the pairs \((x_1,y_1),\ldots,(x_N,y_N)\) and then an interpolation \(F^{-1}(x)\) for the inverse function using the reverse pairs \((y_1,x_1),\ldots,(y_N,x_N)\). Then we take a good sample of uniform random numbers \(u_1,\ldots,u_m\) and evaluate \(F^{-1}(u_1),\ldots,F^{-1}(u_m)\).

For the running example we get

(let ((sine-cdf-inverse (interpolate sine-cdf-ys sine-cdf-xs)))
   (defun sample-sine-pdf (m)
       (loop repeat m collect (funcall sine-cdf-inverse (random 1.0)))))

(format nil "~{~2,4f~%~}" (sample-sine-pdf 10))
SAMPLE-SINE-PDF
.6104
.5235
.8408
.6770
.7273
.3837
.0471
.5568
.3339
.7586