The Kitchen Sink and Other Oddities

Atabey Kaygun

Phase transitions in entropy

Description of the problem

In my last post I looked at the way the information content (the entropy) changes as I increased the lengths of the n-grams. The text I consider was a randomly chosen genome sequence and Twain’s Huckleberry Finn and Tom Sawyer.

One curious observation I made in that post was the fact that there was a dramatic phase shift. Today, I will test if that shift is specific to the genome data or it is an artifact of the method.

An experiment

I will use the following functions in my experiment. The first function is for creating a frequency table for n-grams for a specific n. The second function is for calculating the entropy, and the last for calculating the derivative of the entropy function. As you might recall, the derivative will tell us where the phase transition happens.

(defun get-ngrams (file len getline)
   (let* ((result (make-hash-table :test 'equal)))
      (with-open-file (in file :direction :input)
         (do ((buffer 
                 (funcall getline in)
                 (if (< (length buffer) (1+ len))
                    (concatenate 'string (subseq buffer 1) (funcall getline in))
                    (subseq buffer 1))))
               ((< (length buffer) len) result)
             (let ((key (subseq buffer 0 len)))
               (setf (gethash key result) 
                     (1+ (gethash key result 0))))))))

GET-NGRAMS

(defun temperature (hash)
  (let ((result 0.0d0)
        (n 0))
    (maphash 
       (lambda (x y) 
          (incf result (* y (log y 2.0d0)))
          (incf n y))
       hash)
    (- (log n 2.0d0) (/ result n))))

TEMPERATURE

(defun diff (sequence)
   (let (result)
      (do ((x sequence (cdr x)))
          ((null (cadr x)) (nreverse result))
        (push (- (cadr x) (car x)) result))))

DIFF

For the first experiment, I will re-consider the genome data again.

(defun get-line-genome (stream)
   (remove-if-not
       (lambda (x) (or (char= x #\a) (char= x #\g) (char= x #\t) (char= x #\c)))
       (read-line stream nil)))

GET-LINE-GENOME

(defparameter *result*
   (diff (loop for i from 1 to 18 
           collect (temperature (get-ngrams "../data/gbpat91-processed.seq" i 'get-line-genome))))) 

*RESULT*

Now, I will perform the same analysis for the first \(10^7\) decimal digits of \(\pi\).

(defun get-line-pi (stream)
   (remove-if-not
       (lambda (x) (and (char>= x #\0) (char<= x #\9)))
       (read-line stream nil)))

GET-LINE-PI

(defparameter *result* 
   (diff (loop for i from 1 to 9
           collect (temperature (get-ngrams "../data/digits-of-pi-1e7" i 'get-line-pi)))))

*RESULT*

The natural question is, are these results which show similar phase transitions, particular to these datasets or are they artifacts of the method. So, I will generate \(10^6\) uniformly random decimal digits and repeat the analysis on that data.

(defun random-line (n m)
   (reduce 
      (lambda (x y) (concatenate 'string x y))
      (loop repeat n collect (format nil "~x" (random m)))))

RANDOM-LINE

(with-open-file (out "temp" 
                     :direction :output
                     :if-exists :supersede
                     :if-does-not-exist :create)
   (loop repeat 1000 do (format out "~s~%" (random-line 100 10))))

NIL

(defparameter *result* 
  (diff (loop for i from 1 to 9
           collect (temperature (get-ngrams "temp" i (lambda (x) (read-line x nil)))))))

*RESULT*

OK. The phase transition happens here as well but the place it happens is shifted. What happens if I change these random numbers from decimal to quaternary, as in the genome data?

(with-open-file (out "temp"
                     :direction :output
                     :if-exists :supersede
                     :if-does-not-exist :create)
   (loop repeat 1024 do (format out "~s~%" (random-line 128 4))))

NIL

(defparameter *result*
  (diff (loop for i from 1 to 12 
           collect (temperature (get-ngrams "temp" i (lambda (x) (read-line x nil)))))))

*RESULT*

Analysis

Hmmm. I don’t know what to say. The transition is common to all of these examples but the width of the transition period, and where they start all differ. I need to understand this on a more conceptual footing. I need to think a bit more…