The Kitchen Sink and Other Oddities

Atabey Kaygun

Text Summarization and Topic Analysis

Description of the problem

This is my second post on the subject, but you can read it independently of my first post.

On the most abstract terms: Given a text, or a corpora, I’d like to create a discrete Markov Chain Model best explaining the text and then make a statistically quantifyable analysis of the text. Specifically, I would like to summarize the text, and also, get a list of topics for the text(s) at hand.

Similarity measure

I will view individual sentences within each text as a bag of words. To be precise, I will consider each sentence as a list of words and their frequencies. For example, I will represent the sentence

My colorless green ideas sleep furiously on the green lawn of my little green house with my colorless furries.

as

((furries. . 1) (sleepless . 1) (with . 1) (house . 1) (little . 1)
 (my . 2) (of . 1) (lawn . 1) (the . 1) (on . 1) (furiously . 1)
 (sleep . 1) (ideas . 1) (green . 3) (colorless . 1))

I will not consider the syntax, nor the grammar, nor the parts of speech. It is a very simplistic model, but let us see how far we can go :) I will do one non-trivial thing though. I will stem the words before I bag them.

Before I give you the code, I am going to need two thunks, the merge-with function and the ->> macro, that I got used to working with from clojure. I explained why and how I use them in two earlier posts here and here.

Now, the function that returns bag of words for a given sentence:

(defun bag-of-words (sentence)
   (let ((junk (coerce "\@#$%^&*[]_+-=(){}\'\:\",/<>“”’‘–—" 'list)))
      (->> (remove-if (lambda (x) (member x junk)) sentence)
           (split "\s+")
           (mapcar (lambda (x) (cons (stem (string-downcase x)) 1)))
           (merge-with #'+))))
BAG-OF-WORDS
(defvar example
        (bag-of-words "Colorless green ideas sleep furiously on the green
                       lawn of my little green house with my sleepless furries."))
EXAMPLE
example
((furries. . 1) (sleepless . 1) (with . 1) (hous . 1) (littl . 1) (my . 2)
 (of . 1) (lawn . 1) (the . 1) (on . 1) (furious . 1) (sleep . 1) (idea . 1)
 (green . 3) (colorless . 1))

As the dot product of two vectors tell us how similar they are, in the case of sentences, I’d like to define a dot product between sentences that tells me how similar two sentences are. In this case I’d like to measure the number of commons words between two sentences with weights.

(defun dot (bag1 bag2)
   (loop for x in bag2 sum
         (* (cdr x)
            (or (cdr (assoc (car x) bag1 :test #'string=)) 1))))
DOT
(dot example (bag-of-words "How green was my little sleepy valley"))
10

We can turn this on its head: we can measure how similar two words are depending on how many times they co-occur within the same sentence in a document. But I am not going to define a new function for that. Instead, I will define a new representation of a document from which we can infer both of these measures, and more, easily.

Another representation of a document

Now, consider a document which contains \(n\) sentences and \(m\) unique (stemmed) words. Consider an \(n\times m\) matrix \(A\). Label the rows with sentences, say ordered in the natural order they appear within the document, and label the columns with words, say ordered in lexicographical ordering. We let \(a_{ij}\) to be the number of occurences of the word \(w_j\) within the sentence \(s_i\).

The best thing about this matrix is that \(A A^t\) is the \(n\times n\) matrix that gives us the matrix of measuring common words between sentences. On the other hand, \(A^t A\) is the \(m\times m\) matrix whose entries marked by pairs of words that gives us the number of sentences that pair of words cooccur.

Singular Value Decomposition of this matrix analyzes these matrices, calculates their eigenvalues and eigenvectors.

Now, let us write the function constructing the matrix we described above: first, a utility function that slurps a text.

(defun slurp (file)
   (with-open-file (in file :direction :input)
       (do* ((line (read-line in nil) (read-line in nil))
             (res line (concatenate 'string res " " line)))
            ((null line) res))))
SLURP

Next, given a chunk of text we are going to split it along the end of sentence markers.

(defun get-sentences (blob)
   (->> (split "[\.?!;]\s+" blob)
        (mapcar (lambda (x) (cons x (bag-of-words x))))))
GET-SENTENCES

Let me test:

(get-sentences "Hey! This is a test. Did it work?  Yes! ")
((Hey (hei . 1)) (This is a test (test . 1) (a . 1) (is . 1) (thi . 1))
 (Did it work (work . 1) (it . 1) (did . 1)) (Yes (ye . 1)))

Getting the list of words now is easy: we need to append the individual bags of words, and then apply (merge-with #'+).

(let ((test (get-sentences "Hear, hear! Or don't. But if you do, make sure of being here and hear. ")))
   (merge-with #'+ (reduce #'append test :key #'cdr)))
((but . 1) (if . 1) (you . 1) (do . 1) (make . 1) (sure . 1) (of . 1)
 (be . 1) (here . 1) (and . 1) (or . 1) (dont . 1) (hear . 3))

Now, let us process a given text file. The computationally intensive part is the singular value decomposition:

(defun process (file stops)
   (let* ((sentences (->> file slurp get-sentences))
          (stop-words (->> stops slurp (split "\s+") (mapcar #'stem)))
          (words (->> (reduce #'append sentences :key #'cdr)
                      (merge-with #'+)
                      (remove-if (lambda (x) (member (car x) stop-words :test #'string=)))))
          (n (length sentences))
          (m (length words))
          (matrix (make-array (list n m))))
      (dotimes (i n)
         (dotimes (j m)
            (setf (aref matrix i j)
                  (dot (cdr (elt sentences i))
                       (list (elt words j))))))
      (list (mapcar #'car sentences)
            (mapcar #'car words)
            (svd matrix))))
PROCESS

I added a small bit about removing the stop words. The rest is shuffling terms around. SVD returns 3 items:

  1. Eigenvectors for \(AA^t\)
  2. Eigenvalues for both \(A^tA\) and \(AA^t\)
  3. Eigenvectors for \(A^tA\)

Eigenvalues are ordered from largest to smallest and eigenvectors are in the right order. I am going to need the vectors corresponding to the largest eigenvector. Here is the report generator from the SVD matrix result.

(defun normalize (xs)
   (let ((n (length xs))
         (sum (reduce #'+ xs)))
      (map 'list (lambda (x) (* x (/ n sum))) xs)))
NORMALIZE
(defun report (sentences words svd-result t1 t2)
   (let* ((sbag (mapcar #'cons sentences (normalize (grid:column (svd-u svd-result) 0))))
          (tbag (mapcar #'cons words (normalize (grid:row (svd-vt svd-result) 0))))
          summary topics)
      (dolist (x sbag)
         (if (> (cdr x) t1) (push (car x) summary)))
      (dolist (x (sort tbag #'> :key #'cdr))
         (if (> (cdr x) t2) (push (car x) topics)))
      (cons (reverse summary) topics)))
REPORT

From the list of sentences and the part of SVD for the sentences, we pick those sentences whose weight is larger than a threshold t1. We repeat the same for the words. The biggest difference is, though, I need to keep the order the sentences in the their specific order they appear in the original document, but the words that will appear in the topic summary need to be ordered from the highest to lowest weight.

Now, let us put all of this together:

Here is a news story that I summarized:

The Obama administration has backed down in its bitter dispute with Silicon Valley over the encryption of data on iPhones and other digital devices, concluding that it is not possible to give American law enforcement and intelligence agencies access to that information without also creating an opening that China, Russia, cybercriminals and terrorists could exploit. With its decision, which angered the FBI and other law enforcement agencies, the administration essentially agreed with Apple, Google, Microsoft and a group of the nation’s top cryptographers and computer scientists that millions of Americans would be vulnerable to hacking if technology firms and smartphone manufacturers were required to provide the government with “back doors,” or access to their source code and encryption keys. Companies like Apple say they are protecting their customers’ information by resisting government demands for access to text messages. Security experts like Richard A. Clarke, the former White House counterterrorism czar, also signed the letter to Obama. Mr Comey had expressed alarm a year ago after Apple introduced an operating system that encrypted virtually everything contained in an iPhone. His concern about what the FBI calls the “going dark” problem received support from the director of the National Security Agency and other intelligence officials.

and the list of words I got as “topics”:

effort put comei commun neumann system year kei comput fbi open devic execut chines door googl inform intellig digit data back hous white agenc enforc nation offici secur technolog compani mr law administr obama access govern appl encrypt

Now, here is another document from US Supreme Court.

The Clean Air Act directs the Environmental Protection Agency to regulate emissions of hazardous air pollutants from power plants if the Agency finds regulation “appropriate and necessary.” We must decide whether it was reasonable for EPA to refuse to consider cost when making this finding. regulation is appropriate and necessary after considering the results of the study,” it “shall regulate [power plants] under [7412].” Ibid EPA has interpreted the Act to mean that power plants become subject to regulation on the same terms as ordinary major and area sources, see 77 Federal Regulations 9330 (2012), and we assume without deciding that it was correct to do so. EPA’s disregard of cost rested on its interpretation of 7412(n)(1)(A), which, to repeat, directs the Agency to regulate power plants if it “finds such regulation is appropriate and necessary.” The Agency accepts that it could have interpreted this provision to mean that cost is relevant to the decision to add power plants to the program. But when adopting the regulations before us, the Agency insisted that the provisions concerning all three studies “provide a framework for [EPA’s] determination of whether to regulate [power plants].” 76 Federal Regulations 24987. Turning to the mechanics of the hazardous-air-pollutants program, EPA argues that it need not consider cost when first deciding whether to regulate power plants because it can consider cost later when deciding how much to regulate them. hence 7412(n)(1)(A)‘s requirement to study hazards posed by power plants’ emissions “after imposition of the requirements of [the rest of the Act].” But if uncertainty about the need for regulation were the only reason to treat power plants differently, Congress would have required the Agency to decide only whether regulation remains “necessary,” not whether regulation is “appropriate and necessary.” In any event, EPA stated when it adopted the rule that “Congress did not limit [the] appropriate and necessary inquiry to [the study mentioned in 7412(n)(1)(A)].” 77 Federal Regulations 9325. Because power plants are regulated under other federal and state laws, the best-performing power plants’ emissions limitations might reflect cost-blind regulation rather than cost-conscious decisions.

and its topics:

doe part necessary. 7412n1a regulatori interpret relev standard benefit feder necessari health congress studi decis program environment make requir clean sourc direct hazard find pollut act reason air whether decid emiss consid appropri epa power plant agenc cost regul