An implementation of the Viterbi algorithm in Common Lisp

The setup

Assume we a discrete sequence of observations $x_1,\ldots,x_n$ where each observation can be assigned a label from a finite set $y_1,\ldots,y_m$. Assume also that we know the probability distribution for the labels $p(y)$, and the probability of seeing an observation $x$ under the label $y$ as $p(x\|y)$. Finally, assume that we also have transition probabilities: seeing label $y'$ right after seeing a label $y$ as $p(y'\|y)$.

The problem

Find an assignment of labels $\ell\colon \{1,\ldots,n\}\to \{1,\ldots,m\}$ such that the sequence \[ y_{\ell(1)},\ldots,y_{\ell(n)} \] is the most likely among all such sequences.

The theoretical solution

From Bayes’ Theorem we see that \[ p(y\|x) = \frac{p(x\|y)p(y)}{p(x)} \] This means, given a labeling scheme $\ell\colon \{1,\ldots,n\}\to \{1,\ldots,m\}$, the probability of the sequence of labels is \[ \pi(\ell) = \prod_{i=1}^n p(y_{\ell(i)}\|x_i) p(y_{\ell(i+i)}\|y_{\ell(i)}) \] After using Bayes’ Theorem we get \[ \pi(\ell) = \prod_{i=1}^n \frac{p(x_i\|y_{\ell(i)})p(y_{\ell(i)})}{p(x_i)} \prod_{i=1}^{n-1} p(y_{\ell(i+1)}\|y_{\ell(i)}) \] Since $p(x_i)$ does not depend on $\ell$ we can remove it and we get \[ \pi(\ell) \sim \prod_{i=1}^n p(x_i\|y_{\ell(i)}) p(y_{\ell(i)}) p(y_{\ell(i)}\|y_{\ell(i-1)}) \] Here, we use the convention that $p(y_{\ell(1)}\|y_{\ell(0)}) = 1$. Now, we need to maximize this quantity by choosing a suitable $\ell$.

Recursion with no memory

If we consider the problem one step at a time (which does not give the correct answer) the problem recursively reduces to:

Given $\ell(i-1)$, find $\ell(i)\in \{1,\ldots,m\}$ such that the quantity \[ p(x_i\|y_{\ell(i)}) p(y_{\ell(i)}) p(y_{\ell(i)}\|y_{\ell(i-1)}) \] is maximal.

For now, let us run with this. The recursive reduction tells us that if we define a function $a(k,j,i)$ where \[ a(k,j,i) = p(x_j\|y_k) p(y_k) p(y_k\|y_i) \] then \[ \ell(i) = \text{argmax}_k a(k,i,\ell(i-1)) \]

An implementation

I will assume that the probabilities are defined as arrays, with one or two indices depending on whether we have ordinary probability distribution or a conditional one. And we define

(defun viterbi (obs init-state p-init p-tran p-emit)
   (let* ((m (array-dimension p-emit 0)))
       (labels
          ((next (j i)
              (car (sort
                      (loop for k from 0 below m collect
                         (cons k (* (aref p-init k)
                                    (aref p-emit k j)
                                    (if (> i -1)
                                       (aref p-tran i k)
                                       1.0))))
                      (lambda (x y) (> (cdr x) (cdr y))))))
           (iter (obs res sco)
              (if (null obs)
                  (cons (cdr (reverse res)) sco)
                  (let ((x (next (car obs) (car res))))
                     (iter (cdr obs)
                           (cons (car x) res)
                           (+ sco (- (log (cdr x)))))))))
         (iter obs '(-1) 0))))
VITERBI

In the implementation I took the negative logarithms of the probabilities to prevent underflow.

Examples

Consider the following example taken from Viterbi Algorithm page on Wikipedia.

(let ((p-init  #(0.6 0.4))
      (p-tran #2A((0.7 0.3) (0.4 0.6)))
      (p-emit #2A((0.1 0.4 0.5)
                  (0.6 0.3 0.1))))
  (viterbi '(2 1 0) -1 p-init p-tran p-emit))
((0 0 1) . 5.618853263870896)

The following example is a modified version of the example from a Nature paper.

(let ((p-init  #(0.3 0.3 0.3))
      (p-tran #2A((0.75 0.25 0.00)
                  (0.00 0.00 1.00)
                  (0.00 0.00 1.00)))
      (p-emit #2A((0.25 0.25 0.25 0.25)
                  (0.15 0.00 0.85 0.00)
                  (0.40 0.10 0.10 0.40)))
      (seq (mapcar (lambda (x) (position x '(:a :c :g :t)))
                  '(:c :t :t :c
                    :a :t :g :t
                    :g :a :a :a
                    :g :c :a :g
                    :a :c :g :t
                    :a :a :g :t
                    :c :a :g :t))))
  (viterbi seq -1 p-init p-tran p-emit))
((0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2) . 76.73498296015836)

What is next?

The recursive reduction looks only at the previous step, which is not the right way to find the solution. That is why the result I obtained above for the genomic sequence differs from the original given in the paper I mentioned above. Also, the matrix for the transition probabilities for the labels needs to be generated for the problem at hand carefully. One can use Baum-Welch Algorithm to do just that.