The Kitchen Sink and Other Oddities

Atabey Kaygun

Turkish Sentiment Analysis Using Thesaurus Distance

Description of the problem

In my last post I implemented a word distance algorithm using a thesaurus for English. Today, I will do the same for Turkish. It is difficult to find a free Turkish Thesaurus. After some digging around I found one. The Turkish Thesarus developed by volunteers and maintained by Anil Ozbek.

Implementation

Let us look at the structure of the thesaurus file:

aba|2
(noun)|abla|anne
(noun)|barak|çuha|çul|keçe|kebe|şayak|palas

In the file “|” symbol is used to separate fields. The first field is the word, the second is the number of distinct senses of the word listed here. In this example the word is “aba” and it has 2 senses listed here. Then the next two lines are these senses. The first field designates which part of speech the sense belongs to followed by each synonym separated by “|”.

Here is the lisp code to load the thesaurus

(defvar data 
   (let ((res (make-hash-table :test 'equal))
         (vertex ""))
      (with-open-file (infil "../data/turkish_thesaurus.txt" :direction :input)
         (read-line infil nil)
         (do ((line (read-line infil nil) 
              (read-line infil nil)))
             ((null line) res)
             (if (char= (elt line 0) #$)
                 (dolist (word (cdr (ppcre:split #\| line)))
                    (push word (gethash vertex res))
                    (push vertex (gethash word res)))
                 (setf vertex (car (ppcre:split #\| line))))))
         (maphash 
            (lambda (x y)
                (setf (gethash x res) (remove-duplicates y :test 'equal)))
            res)
         res))

DATA

I will use the same distance function as before

(defun dist (x y &optional (n 5))
   (let ((res (gethash x data))
         (i 0)
         (temp nil))
      (loop while (and (not (member y res :test 'string-equal)) (< i n)) do
         (incf i)
         (setf temp (loop for j in res append (gethash j data)))
         (setf res temp))
      i))

DIST

and a simple test:

(dist "hastalık" "korku")

1

(dist "baba" "gurur")

3

Turkish sentiment analysis using thesaurus distance

As before, I will use the 6 basic emotions and few extra words as my basis categories:

(defvar categories (list "korku" "öfke" "iğrenç" "üzüntü" "mutlu" "şaşkın" "utanç" "gurur"))

CATEGORIES

and now comes the measuring function:

(defun measure(x)
   (cons x (mapcar (lambda (u) (- 5 (dist u x))) categories)))

MEASURE

I will test my function of a (not-so) random selection of words. Unlike last post, this time larger the measured value greater the emotional dimension.

(defvar test (list "adam" "kadın" "çocuk" "aile" "ölüm" "doğum" "düğün" "şiddet" "polis" "devlet" "tecavüz" "dayak" "açlık" "soğuk"))

TEST


         korku     öfke      iğrenç    üzüntü    mutlu     şaşkın    utanç

adam 1 3 0 2 1 0 0 kadın 1 1 1 1 1 1 1 çocuk 2 1 1 1 1 1 1 aile 2 1 0 2 1 0 1 ölüm 4 3 1 5 2 0 3 doğum 2 1 1 3 4 1 3 düğün 2 1 0 2 2 1 0 şiddet 2 1 1 2 1 0 1 polis 3 0 2 2 1 0 2 devlet 3 3 0 2 2 0 2 tecavüz 0 0 0 0 0 0 0 dayak 2 1 0 3 0 0 1 açlık 2 1 0 2 0 0 0 soğuk 4 2 0 4 2 1 2