The Kitchen Sink and Other Oddities

Atabey Kaygun

Sentiment analysis using word distances

Description of the problem

A thesaurus is a dictionary which gives a list of somewhat equivalent words. A thesaurus path is a sequence of words

word1, word2, word3, … , wordn

such that any two consecutive words appear in the list of entries of a third word, or of either word in a given thesaurus. I will define the thesaurus distance of a pair of words is the length of the shortest thesaurus path connecting one word to the other.

Today, I will implement the thesaurus distance.

Implementation

Any implementation of the thesaurus distance depends on the thesaurus one uses. I am going to use the English thesaurus of Moby Project. The data consists of words separated by commas in which each line is a collection of words which are synonymous.

The data I am going to use is a hash-map indexed by individual words storing the list of synonyms for that word. Essentially, I am creating a undirected graph whose vertices are marked by words and two words are connected by an edge if and only if these two words are synonymous.

(require :cl-ppcre)

NIL

(defvar data 
   (let ((res (make-hash-table :test 'equal))
         (vertex ""))
      (with-open-file (infil "../data/moby-thesarus.csv" :direction :input)
         (do ((line (ppcre:split #\, (read-line infil nil))
                (ppcre:split #\, (read-line infil nil))))
             ((null line) res)
           (setf vertex (car line))
           (dolist (word (cdr line))
              (push word (gethash vertex res))
              (push vertex (gethash word res))))
         (maphash 
            (lambda (x y)
               (setf (gethash x res) (remove-duplicates y :test 'equal)))
            res)
      res)))

DATA

By the way, I am using a lot of memory to load the data up. You might need to increase the stack size limit in your lisp implementation. I also tried this in scala, but it didn’t work because of insufficient memory. I was lazy to dig in to figure out how to fix it. Besides, lisp version works just fine :)

Now comes my implementation of the distance function.

(defun dist (x y)
   (let ((res (list x))
         (i 0)
         (temp nil))
      (loop while (and (not (member y res :test 'string-equal)) (< i 5)) do
         (incf i)
         (setf temp (loop for j in res append (gethash j data)))
         (setf res temp))
      i))

DIST

Let me test it

(dist "terrified" "cold")

2

Sentiment analysis

Here is a an experiment. There are six basic emotions: anger, fear, disgust, sadness, surprise and happiness. Assuming the thesaurus distance of a word \(w\) to, say “fear”, indicates the emotional content of the word \(w\) in the fear dimension, can we analyze a text using this metric? Below, I will also trow in few extra words along with these six basic emotions.

(defvar categories (list "anger" "fear" "disgust" "sad" 
                         "guilt" "embarrassment" "shame" "pride" 
                         "surprise" "happy"))

CATEGORIES

(defvar test (list "death" "sickness" "doctor" "ugly" "beauty" 
                   "mother" "father" "family" "marriage" "child" 
                   "hunger" "snake" "cat" "cold" "sea" "debt" "kale"
                   "government" "police" "violence" "rape" "murder"))

TEST

(defun measure(x)
   (cons x (mapcar (lambda (u) (dist u x)) categories)))

MEASURE

The results are shown below. The smaller the number the better the score is in that dimension.

anger fear disgust sad guilt embarrassment shame pride surprise
death 2 2 3 2 2 2 2 3 3
sickness 3 2 1 3 3 2 3 3 3
doctor 2 3 3 3 3 2 2 2 3
ugly 3 3 3 2 3 2 2 3 3
beauty 3 2 3 3 3 3 2 3 2
mother 2 3 3 3 4 3 3 3 3
father 2 3 3 3 3 3 3 2 3
family 2 2 3 3 3 2 2 2 3
marriage 2 3 3 3 3 2 3 3 3
child 3 3 3 3 3 3 3 3 3
hunger 2 3 3 3 3 3 2 2 2
snake 2 3 2 3 3 2 3 2 2
cat 2 3 3 3 3 2 3 3 2
cold 3 3 3 2 3 3 2 2 3
sea 2 3 3 2 3 2 3 3 3
debt 3 3 3 2 2 2 2 3 3
kale 3 3 3 3 4 3 3 3 3
government 3 2 3 3 3 2 3 2 3
police 2 3 3 3 3 3 3 3 3
violence 2 2 2 2 3 2 2 3 2
rape 2 3 2 3 2 2 1 3 2
murder 3 3 3 3 3 2 2 2 2

Analysis

The method may not be very practical on large bulks of text as the memory requirements are big and the distance algorithm is not very efficient.