Patterns Separating Large Texts

Description of the problem

The other day, I was looking at a problem of pattern matching. Specifically, I was thinking about what parts of a pattern is most significant when trying to match it against a large set of strings. The solution I gave used entropy, much of the same way decision trees use entropy but instead of minimizing entropy I chose the maximization route.

In that problem I had one luxury: the strings (both the pattern string and the strings in the set) had a beginning and an end. In cases like genome sequences this is one luxury we don’t have. Plus there are the problems of mutations, insertions and deletions. An exact match is usually not the thing one wants. I guess the success of BLAST, or BITAP type algorithms and their variations, comes from the fact that these are fast approximate matching algorithms.1

So, today I will consider another matching problem: I were to find a specific pattern (preferably large) to match against a large set of large strings, which part of the pattern should I choose to match against the large set?

Given a large pattern and even larger string, or set of strings, to match which part, or parts, of the string should I choose to maximize my chances of a total match?

Let me make this clear: this is not an alternative to fast approximate search algorithms such as BLAST, but rather a complement by helping us to choose a specific part of the pattern to match against the large set of strings.

The delimiters

Assume we have an alphabet consisting of letters $a_0,a_1,\ldots,a_n$ and a language which uses this alphabet to form its vocabulary and sentences. Assume that the language uses $a_0$ as the delimiter to delineate the beginnings and the ends of each word. What is the expected value of the length of a word in this language? This is a nice small problem I need to understand before I embark on solving the other problem.

Each language (human or otherwise) has a specific distribution of characters (assuming the languages we compare use the same set of characters). Assume $p_0,p_1,\ldots,p_n$ be the probabilities of seeing the characters $a_0, a_1,\ldots,a_n$ respectively. Then the expected length is the result from an infinite series: \[ E(\ell) = \sum_{\ell=0}^\infty \ell (1-p_0)^\ell p_0 \] The sum can be obtained from the power series \[ x(1-x) \frac{d}{dx}\sum_{\ell=0}^\infty x^\ell = x(1-x)\frac{d}{dx}\frac{1}{1-x} = \frac{x}{1-x} \] by substituting $1-p_0$ for $x$. Thus \[ E(\ell) = \frac{1-p_0}{p_0} = \left(\frac{1}{p_0}-1\right) \]

Experiment with Mark Twain

First let me do a small experiment with a text from English.

Below, I will find the character distributions form the texts “Adventures of Tom Sawyer” and “Adventures of Huckleberry Finn” by Mark Twain. The first function processes the text file:

(defun character-distribution (file)
   (let ((hash (make-hash-table)))
      (with-open-file (in file :direction :input)
         (do ((line (read-line in nil) (read-line in nil)))
             ((null line) hash)
           (map nil 
                (lambda (x) (setf (gethash x hash) (1+ (gethash x hash 0))))
                (string-downcase line))))))

CHARACTER-DISTRIBUTION

And the function that calculates the probabilities:

(defun experiment (distribution myfilter)
   (let ((n 0)
         (probs nil))

      (maphash 
         (lambda (x y) (if (funcall myfilter x) (incf n y)))
         distribution)

      (maphash 
         (lambda (x y) (push (cons x (/ y n 1.0d0)) probs)) 
         distribution)

      (cons 
         n 
         (remove-if-not 
             (lambda (x) (funcall myfilter (car x)))
             (sort probs (lambda (x y) (> (cdr x) (cdr y))))))))

EXPERIMENT

I will need a filter function will drop all characters other than ‘a’ to 'z’ or space. Then we perform the test on the text I chose:

(defvar text-result 
          (experiment 
             (character-distribution "../data/twain-tom-sawyer.txt")
             (lambda (x) (or (char= x #\Space) (char<= #\a x #\z)))))

TEXT-RESULT

Character Probability

e t a o n h i s r d l u w m y g c f b p k v j x q z N=367103	0.1915 0.0974 0.0788 0.0641 0.0635 0.0551 0.0532 0.0513 0.0485 0.0417 0.0406 0.0331 0.0245 0.0219 0.0195 0.0184 0.0181 0.0179 0.0165 0.0135 0.0128 0.0083 0.0064 0.0018 0.0008 0.0005 0.0004
cording to t ginning and e text. This rac{1}{0.198	he data, the space character (which delineates the the end of each word) has a 19.8% chance of appearing in means the expected length of a word is \[ }-1 \approx 4.1 \]
eoretically, (1-p_0 dot (1-0.198 ere are 4569 the theore “Tom Sawyer ain’s texts, lineated bet xt appearing aracters bet And for $\e$$ \frac{ 3 pprox 0 $$ F treme. I wou mething of m -0.198)^ ell$ is appr not $26^3$ e approximat cording to [ te](http://w mber of 3 le ere are a l ample.	the expected number of words of length $\ell$ is )^p_0$. So, for $\ell=4$ we have \[ 367103 )^4 0.198 \approx 30071. \] Considering the fact that 76 different words (meaningful, or meaningless) of length tical* chances of finding any of these words from the text ” is 6.58 %. If I were to match a pattern against the I would not choose a word of length 4 much less a word ween two spaces in either side. I would go for part of the between 2 z’s as the expected length of the number of ween two z’s is \[ \frac{1}{0.0004}-1 \approx 2499.0 ll=2499$ the theoretical expected number of matches 67103 \cdot (1-0.198)^{2499} 0.198}{26^{2499}} or computational reasons, length 2499 is rather ld need a letter used as delineation that creates anageable size, say about 0.1%: \] $$ This means oximately 3. However, the number of words of length 3 = 17576 in English language. As a matter of fact there ely 1300 words of length 3 in the English language a reputable scrabble ww.scrabble.org.au/words/3sdefs.htm). Since the expected tter words in the text is 37495. we can safely say that ot* of repetitions in the text. Think of word 'the’ for
need a long tch, but not ch that the 367103 stance, if w e expected l th 'w’ is $	enough sequence to reduce the chance of an accidental something unnecessarily long. I want a length $\ell$ expected number of words of length $\ell$ t (1-p)^p \[ is manageably small. For e choose \'w' as our delimiter, $p\approx 0.0219$ then ength of substrings in "Tom Sawyer" that start and end $ \frac{1}{0.0219} - 1 \approx 44.7 \]
ere are appr Tom Sawyer	oximately 8041. substrings which start and end with a 'w’ (this comes from the probabilities). The actual count is
cat ../dat	a/twain-tom-sawyer.txt \| tr A-Z-z \| sed s/w//g \| wc -l
8056
tween Tom Sa art with 'w	wyer and Huckleberry Finn the overlapping substrings that ’ are:
cat ../dat	a/twain-* \| tr A-Z-z \| sed s/w//g \| uniq -c \| sort -k 1 -n -r
2 e’re los 2 ell? 2 ell, 2 eeks and	t!
e rest are u	nique. For example, the following short string
… with t	heir toes pressed against a w…
pears in Tom ring to use rker, we wou xt if there	Sawyer but not in Huckleberry Finn, and therefore is good to match against Tom Sawyer: if a text contains this ld put the text through further tests, but discard the is no match.
Discussion
man language lot of redun tacks I trie e same analy	s are hard because the words are not that random. There is dancy is built-in which makes the type of statistical d to described above difficult. In few days, I will repeat sis for a genome sequence data.
: footnotes