The other day I was looking at a problem of determining how long a string one needs to use to identify a large text. As we increase the length of the substring, the chances of an accidental match reduces. However, there will be a sweet spot: after a point increasing the length of the substring will not give us an additional information. Today, I will try to figure out this breaking-point.
For todays experiment I will use a sample from genome data GenBank and I will compare the results with “Huckleberry Finn” by Twain. The gene data is a collection of sequences with metadata. In order to get to the genome data, I needed to strip out the metadata and get to the sequence data. Luckily that data is stored within two markers in between “ORIGIN” and “//”. You can whip up a simple perl script (I show my age, aren’t I :) to do that.
I will load the data, and then create the frequency table of n-grams and then compute the entropy of this resulting table. Changes in the entropy will give us the information we need. The following lisp function creates a frequency table for a given length n-gram.
(defun get-ngrams (file len getline)
(let* ((result (make-hash-table :test 'equal)))
(with-open-file (in file :direction :input)
(do ((buffer
(funcall getline in)
(if (< (length buffer) (1+ len))
(concatenate 'string (subseq buffer 1) (funcall getline in))
(subseq buffer 1))))
((< (length buffer) len) result)
(let ((key (subseq buffer 0 len)))
(setf (gethash key result)
(1+ (gethash key result 0))))))))
GET-NGRAMS
And then we calculate the entropy in a modified form. See here on why I do that.
(defun temperature (hash)
(let ((result 0.0d0))
(maphash
(lambda (x y) (incf result (* y 1.0d0 (log y))))
hash)
result))
TEMPERATURE
Then, since I am looking for a phase transition, I need to look at the first derivative of the results:
(defun diff (sequence)
(let (result)
(do* ((x sequence (cdr x)))
((null (cadr x)) (nreverse result))
(push (- (cadr x) (car x)) result))))
DIFF
Now, let us see what is going on:
(defun get-line-genome (stream)
(remove-if-not
(lambda (x) (or (char= x #\a)
(char= x #\t)
(char= x #\g)
(char= x #\c)))
(read-line stream nil)))
GET-LINE-GENOME
(with-open-file (out "ngrams-out1.csv"
:direction :output
:if-exists :supersede
:if-does-not-exist :create)
(let ((temps (diff (loop for i from 1 to 17
collect (temperature
(get-ngrams
"../data/gbbct112-processed.seq"
i
'get-line-genome))))))
(dotimes (i (length temps)) (format out "~d ~12,8f~%" (1+ i) (elt temps i)))))
NIL
From this graph we see that increasing the length of n-grams do not yield much on the interval 1 to 5. There is a phase transition at n=6. Any increase result in drastic changes on the interval 6 to 10. Then there is another phase transition at n=11. This suggests that I should pay attention n-grams of length 6 and 11.
Now, I will repeat the test with Huckleberry Finn. For that I will need to modify the function which reads the lines from the input file.
(defun get-line-text (stream)
(remove-if-not
(lambda (x) (or (char= x #\Space)
(and (char>= x #\a)
(char<= x #\z))))
(read-line stream nil)))
GET-LINE-TEXT
(with-open-file (out "ngrams-out2.csv"
:direction :output
:if-exists :supersede
:if-does-not-exist :create)
(let ((temps (diff (loop for i from 1 to 40
collect (temperature
(get-ngrams
"../data/twain-huckleberry-finn.txt"
i
'get-line-text))))))
(dotimes (i (length temps)) (format out "~d ~12,8f~%" (1+ i) (elt temps i)))))
NIL
There is no phase transition in this example. The longer the n-grams the better the result. And since the entropy becomes the maximum at n=38 I must choose n=38 or higher.