In my last post I implemented a word distance algorithm using a thesaurus for English. Today, I will do the same for Turkish. It is difficult to find a free Turkish Thesaurus. After some digging around I found one. The Turkish Thesarus developed by volunteers and maintained by Anil Ozbek.
Let us look at the structure of the thesaurus file:
aba|2
(noun)|abla|anne
(noun)|barak|çuha|çul|keçe|kebe|şayak|palas
In the file “|” symbol is used to separate fields. The first field is the word, the second is the number of distinct senses of the word listed here. In this example the word is “aba” and it has 2 senses listed here. Then the next two lines are these senses. The first field designates which part of speech the sense belongs to followed by each synonym separated by “|”.
Here is the lisp code to load the thesaurus
(defvar data
(let ((res (make-hash-table :test 'equal))
(vertex ""))
(with-open-file (infil "../data/turkish_thesaurus.txt" :direction :input)
(read-line infil nil)
(do ((line (read-line infil nil)
(read-line infil nil)))
((null line) res)
(if (char= (elt line 0) #$)
(dolist (word (cdr (ppcre:split #\| line)))
(push word (gethash vertex res))
(push vertex (gethash word res)))
(setf vertex (car (ppcre:split #\| line))))))
(maphash
(lambda (x y)
(setf (gethash x res) (remove-duplicates y :test 'equal)))
res)
res))
DATA
I will use the same distance function as before
(defun dist (x y &optional (n 5))
(let ((res (gethash x data))
(i 0)
(temp nil))
(loop while (and (not (member y res :test 'string-equal)) (< i n)) do
(incf i)
(setf temp (loop for j in res append (gethash j data)))
(setf res temp))
i))
DIST
and a simple test:
(dist "hastalık" "korku")
1
(dist "baba" "gurur")
3
As before, I will use the 6 basic emotions and few extra words as my basis categories:
(defvar categories (list "korku" "öfke" "iğrenç" "üzüntü" "mutlu" "şaşkın" "utanç" "gurur"))
CATEGORIES
and now comes the measuring function:
(defun measure(x)
(cons x (mapcar (lambda (u) (- 5 (dist u x))) categories)))
MEASURE
I will test my function of a (not-so) random selection of words. Unlike last post, this time larger the measured value greater the emotional dimension.
(defvar test (list "adam" "kadın" "çocuk" "aile" "ölüm" "doğum" "düğün" "şiddet" "polis" "devlet" "tecavüz" "dayak" "açlık" "soğuk"))
TEST
korku öfke iğrenç üzüntü mutlu şaşkın utanç
adam 1 3 0 2 1 0 0 kadın 1 1 1 1 1 1 1 çocuk 2 1 1 1 1 1 1 aile 2 1 0 2 1 0 1 ölüm 4 3 1 5 2 0 3 doğum 2 1 1 3 4 1 3 düğün 2 1 0 2 2 1 0 şiddet 2 1 1 2 1 0 1 polis 3 0 2 2 1 0 2 devlet 3 3 0 2 2 0 2 tecavüz 0 0 0 0 0 0 0 dayak 2 1 0 3 0 0 1 açlık 2 1 0 2 0 0 0 soğuk 4 2 0 4 2 1 2