Assume we a discrete sequence of observations \(x_1,\ldots,x_n\) where each observation can be assigned a label from a finite set \(y_1,\ldots,y_m\). Assume also that we know the probability distribution for the labels \(p(y)\), and the probability of seeing an observation \(x\) under the label \(y\) as \(p(x\|y)\). Finally, assume that we also have transition probabilities: seeing label \(y'\) right after seeing a label \(y\) as \(p(y'\|y)\).
Find an assignment of labels \(\ell\colon \{1,\ldots,n\}\to \{1,\ldots,m\}\) such that the sequence \[ y_{\ell(1)},\ldots,y_{\ell(n)} \] is the most likely among all such sequences.
From Bayes’ Theorem we see that \[ p(y\|x) = \frac{p(x\|y)p(y)}{p(x)} \] This means, given a labeling scheme \(\ell\colon \{1,\ldots,n\}\to \{1,\ldots,m\}\), the probability of the sequence of labels is \[ \pi(\ell) = \prod_{i=1}^n p(y_{\ell(i)}\|x_i) p(y_{\ell(i+i)}\|y_{\ell(i)}) \] After using Bayes’ Theorem we get \[ \pi(\ell) = \prod_{i=1}^n \frac{p(x_i\|y_{\ell(i)})p(y_{\ell(i)})}{p(x_i)} \prod_{i=1}^{n-1} p(y_{\ell(i+1)}\|y_{\ell(i)}) \] Since \(p(x_i)\) does not depend on \(\ell\) we can remove it and we get \[ \pi(\ell) \sim \prod_{i=1}^n p(x_i\|y_{\ell(i)}) p(y_{\ell(i)}) p(y_{\ell(i)}\|y_{\ell(i-1)}) \] Here, we use the convention that \(p(y_{\ell(1)}\|y_{\ell(0)}) = 1\). Now, we need to maximize this quantity by choosing a suitable \(\ell\).
If we consider the problem one step at a time (which does not give the correct answer) the problem recursively reduces to:
Given \(\ell(i-1)\), find \(\ell(i)\in \{1,\ldots,m\}\) such that the quantity \[ p(x_i\|y_{\ell(i)}) p(y_{\ell(i)}) p(y_{\ell(i)}\|y_{\ell(i-1)}) \] is maximal.
For now, let us run with this. The recursive reduction tells us that if we define a function \(a(k,j,i)\) where \[ a(k,j,i) = p(x_j\|y_k) p(y_k) p(y_k\|y_i) \] then \[ \ell(i) = \text{argmax}_k a(k,i,\ell(i-1)) \]
I will assume that the probabilities are defined as arrays, with one or two indices depending on whether we have ordinary probability distribution or a conditional one. And we define
(defun viterbi (obs init-state p-init p-tran p-emit)
(let* ((m (array-dimension p-emit 0)))
(labels
((next (j i)
(car (sort
(loop for k from 0 below m collect
(cons k (* (aref p-init k)
(aref p-emit k j)
(if (> i -1)
(aref p-tran i k)
1.0))))
(lambda (x y) (> (cdr x) (cdr y))))))
(iter (obs res sco)
(if (null obs)
(cons (cdr (reverse res)) sco)
(let ((x (next (car obs) (car res))))
(iter (cdr obs)
(cons (car x) res)
(+ sco (- (log (cdr x)))))))))
(iter obs '(-1) 0))))
VITERBI
In the implementation I took the negative logarithms of the probabilities to prevent underflow.
Consider the following example taken from Viterbi Algorithm page on Wikipedia.
(let ((p-init #(0.6 0.4))
(p-tran #2A((0.7 0.3) (0.4 0.6)))
(p-emit #2A((0.1 0.4 0.5)
(0.6 0.3 0.1))))
(viterbi '(2 1 0) -1 p-init p-tran p-emit))
((0 0 1) . 5.618853263870896)
The following example is a modified version of the example from a Nature paper.
(let ((p-init #(0.3 0.3 0.3))
(p-tran #2A((0.75 0.25 0.00)
(0.00 0.00 1.00)
(0.00 0.00 1.00)))
(p-emit #2A((0.25 0.25 0.25 0.25)
(0.15 0.00 0.85 0.00)
(0.40 0.10 0.10 0.40)))
(seq (mapcar (lambda (x) (position x '(:a :c :g :t)))
'(:c :t :t :c
:a :t :g :t
:g :a :a :a
:g :c :a :g
:a :c :g :t
:a :a :g :t
:c :a :g :t))))
(viterbi seq -1 p-init p-tran p-emit))
((0 0 0 0 0 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2) . 76.73498296015836)
The recursive reduction looks only at the previous step, which is not the right way to find the solution. That is why the result I obtained above for the genomic sequence differs from the original given in the paper I mentioned above. Also, the matrix for the transition probabilities for the labels needs to be generated for the problem at hand carefully. One can use Baum-Welch Algorithm to do just that.