Few
days ago I wrote a post on using Word2Vec, specifically
Word2VecJava
library from common lisp, using Armed Bear
Common Lisp which a lisp hosted on JVM. This was one of the options
I explored. Today I am going to write on another option: clojure.
For most of the blog posts here, I try to keep the code I am writing pretty tight. No multiple files, no hiearchy of subdirectories etc. Leiningen is a powerful scaffolding tool to build projects, but not for this blog. Boot is a better choice.
I am going to use Boot as I mentioned above. One of the things I like about it is that you can dynamically define your dependencies within your program body. I know that there is pomegranate, but I find using Boot easier.
(set-env! :dependencies
'[[com.medallia.word2vec/Word2VecJava "0.10.3"]])
Now, let me load up the java classes Word2VecModel
and
java.io.File
.
(import com.medallia.word2vec.Word2VecModel)
(import java.io.File)
I am going to need clojure.string
later on.
(require '[clojure.string :as st])
I’ll define a function that would load up a binary model file and load the necessary java classes
(defn load-model [model-file]
(->> model-file
File.
Word2VecModel/fromBinFile
.forSearch))
#'boot.user/load-model
OK. Now that’s done, I am going to define the functions I am going to test today:
(defn cosine-distance [w1 w2 engine]
(.cosineDistance engine w1 w2))
#'boot.user/cosine-distance
(defn raw-vector [word engine]
(.getRawVector engine word))
#'boot.user/raw-vector
(defn get-matches [word num engine]
(map (fn [x] (vector (.getKey x) (.getValue x)))
(.getMatches engine word num)))
#'boot.user/get-matches
(defn analogy [w1 w2 w3 engine]
(get-matches w3 3 (.similarity engine w1 w2)))
#'analogy
Now, let me test. First I’ll load up a model:
(def *engine* (load-model "english-news-model.bin"))
#'boot.user/*engine*
(cosine-distance "allocate" "reserve" *engine*)
0.1157773104488681
(get-matches "home" 4 *engine*)
(("home" 0.9999999999999999) ("bungalow" 0.5845105015466799) ("townhouse" 0.5397085894763874) ("residence" 0.5256589338286517))
I couldn’t get this part of the post work within ABCL, but this is probably because I am not comfortable with the java FFI of ABCL. Clojure’s java interop, especially on the conversion of data structures in and out of java, is more straightforward.
We can get the Word2VecJava
library train a model from
within clojure.
(defn train-model [data]
(.forSearch (.train (Word2VecModel/trainer) data)))
The variable data
has to be a list of lists.
(defn get-data [data-file]
(->> (st/split (->> data-file
slurp
(st/lower-case)
(remove #((set "!@#$%^&*[]_+-=(){};'\:\",/<>?“â€â€™â€˜â€“—") %))
(apply str))
#"\.\s+")
(map #(st/split % #"\s+"))))
I have a collection of writings of Mark Twain from Project Gutenberg
(def new-engine (train-model (get-data "twain.txt")))
Let us see what this model thinks about:
(cosine-distance "tom" "finn" new-engine)
0.9997468543897505