The Kitchen Sink and Other Oddities

Atabey Kaygun

Using Word2Vec from Clojure

Few days ago I wrote a post on using Word2Vec, specifically Word2VecJava library from common lisp, using Armed Bear Common Lisp which a lisp hosted on JVM. This was one of the options I explored. Today I am going to write on another option: clojure.

Leiningen vs Boot

For most of the blog posts here, I try to keep the code I am writing pretty tight. No multiple files, no hiearchy of subdirectories etc. Leiningen is a powerful scaffolding tool to build projects, but not for this blog. Boot is a better choice.

Let’s get to it

I am going to use Boot as I mentioned above. One of the things I like about it is that you can dynamically define your dependencies within your program body. I know that there is pomegranate, but I find using Boot easier.

(set-env! :dependencies
    '[[com.medallia.word2vec/Word2VecJava "0.10.3"]])

Now, let me load up the java classes Word2VecModel and java.io.File.

(import com.medallia.word2vec.Word2VecModel)
(import java.io.File)

I am going to need clojure.string later on.

(require '[clojure.string :as st])

I’ll define a function that would load up a binary model file and load the necessary java classes

(defn load-model [model-file]
   (->> model-file
        File.
        Word2VecModel/fromBinFile
        .forSearch))
#'boot.user/load-model

OK. Now that’s done, I am going to define the functions I am going to test today:

(defn cosine-distance [w1 w2 engine]
   (.cosineDistance engine w1 w2))
#'boot.user/cosine-distance


(defn raw-vector [word engine]
   (.getRawVector engine word))
#'boot.user/raw-vector


(defn get-matches [word num engine]
   (map (fn [x] (vector (.getKey x) (.getValue x)))
        (.getMatches engine word num)))
#'boot.user/get-matches


(defn analogy [w1 w2 w3 engine]
   (get-matches w3 3 (.similarity engine w1 w2)))
#'analogy

Now, let me test. First I’ll load up a model:

(def *engine* (load-model "english-news-model.bin"))
#'boot.user/*engine*


(cosine-distance "allocate" "reserve" *engine*)
0.1157773104488681


(get-matches "home" 4 *engine*)
(("home" 0.9999999999999999) ("bungalow" 0.5845105015466799) ("townhouse" 0.5397085894763874) ("residence" 0.5256589338286517))

Training a model within clojure

I couldn’t get this part of the post work within ABCL, but this is probably because I am not comfortable with the java FFI of ABCL. Clojure’s java interop, especially on the conversion of data structures in and out of java, is more straightforward.

We can get the Word2VecJava library train a model from within clojure.

(defn train-model [data]
   (.forSearch (.train (Word2VecModel/trainer) data)))

The variable data has to be a list of lists.

(defn get-data [data-file]
   (->>  (st/split (->> data-file
                        slurp
                        (st/lower-case)
                        (remove #((set "!@#$%^&*[]_+-=(){};'\:\",/<>?“”’‘–—") %))
                        (apply str))
                   #"\.\s+")
         (map #(st/split % #"\s+"))))

I have a collection of writings of Mark Twain from Project Gutenberg

(def new-engine (train-model (get-data "twain.txt")))

Let us see what this model thinks about:

(cosine-distance "tom" "finn" new-engine)
0.9997468543897505