I have been looking into using word2vec representation of words in some applications. The original implementation, which is in C, is heavily optimized and fast, but not written as a library. There are two java implementations, but since I didn’t want to use java, I explored several options: Using an external java Word2Vec library from
Today, I’d like to write about the 3rd option.
I found two java implementation that I can use as a library for word2vec:
DeepLearning4j has an extensive framework for many data analysis tasks, but I have a distaste for large frameworks. Yes, I want the banana but not the gorilla holding it, nor the jungle behind it.
As for lisp implementations on JVM ABCL seems to be the most sensible choice. I have been experimenting with ABCL for a while. This seems to be a good project to put into use what I have learned.
All of the code below runs on ABCL.
First, let us load the external dependencies from maven. For that I
am going to need abcl-asdf
library:
(require :abcl-asdf)
(java:add-to-classpath
(abcl-asdf:as-classpath
(abcl-asdf:resolve "com.medallia.word2vec:Word2VecJava")))
Now, we need to load a pre-trained model. You can download one from google. Warning, the model I link here is really large. The function below loads such a module up:
(defun load-model (model-file)
(jcall (jmethod (jss:find-java-class "Word2VecModel")
"fromBinFile"
(jss:find-java-class "java.io.File"))
(jss:find-java-class "Word2VecModel")
(jss:new "java.io.File" model-file)))
Instead of the model I linked above, I am going to use a model I trained from Charles Dickens novels I got from Project Gutenberg.
(defvar *model* (load-model "dickens-model.bin"))
Loading the model is not enough. Before we can work with it, we must load the classes that will do the heavy lifting for us:
(defvar *engine* (#"forSearch" *model*))
Now, that we loaded our model and get our engine, we can do several things: We can get the raw numerical vector for a given word:
(defun get-raw-vector (word)
(jss:jlist-to-list (#"getRawVector" *engine* word)))
(get-raw-vector "home")
(-0.045455124953779076d0 -0.07530373816504114d0 -0.1373770205861963d0 -0.09495558319523743d0 -0.19750882287429283d0 -0.04369557384124098d0 0.056864173803157614d0 0.22692607775773763d0 -0.04918216427536493d0 -0.06573755212904461d0 0.010485399806856952d0 -0.0125644020514777d0 -0.03867824662489532d0 -0.12456286407334217d0 -0.012954963105658404d0 0.030908279730781325d0 -0.11467134544176405d0 0.06539173695163725d0 0.13103033598097535d0 7.144583495236003d-4 -0.07119027128394684d0 0.07924226790092176d0 -0.28841552796482617d0 0.19415477991101823d0 -0.08192961323919903d0 -0.11783073799680197d0 0.02541784057745355d0 -0.07604135763281597d0 0.053157081515501685d0 -0.01957647728980732d0 -0.09085722216510987d0 -0.14392533472382035d0 0.03668373024862821d0 0.057883023922329746d0 0.015932456392836705d0 0.02183309123133587d0 -0.01694965098894987d0 0.12502777065149165d0 0.18801864213570413d0 0.024514509414527853d0 0.13138849001188765d0 -0.052477432982261325d0 0.11191356280404362d0 0.12131386500449419d0 -0.1580452165120508d0 -0.03601234870473234d0 -0.11794962962967813d0 0.0696481313652186d0 -0.09822390449213687d0 -0.01863944486286492d0 -0.045831517365810745d0 -0.0326415507430022d0 0.015811289743998244d0 0.03560473087510719d0 -0.06695631037666895d0 -0.030046096497852563d0 0.02240055310634835d0 0.10337602826378908d0 -0.09319892659156494d0 0.09822674374606084d0 0.09359655815308475d0 -0.13900886052709738d0 -0.10425136004763672d0 -0.01143917766889959d0 -0.10838075735344216d0 0.10956555931424225d0 0.006879066499163433d0 0.07245139307028406d0 -0.11482412110500234d0 0.10310742124136513d0 -0.07018913249875851d0 -0.14215548494023766d0 -0.18817284592666467d0 -0.06328703372099141d0 0.17183195350602862d0 0.15445845673613365d0 -0.09132729720549218d0 0.09245411548061742d0 0.049801117380410244d0 0.03749830709895903d0 -0.0025661189184682997d0 0.09430764674811494d0 0.03997497384554557d0 0.0741059470425769d0 -0.04097753225769587d0 0.024709648616479025d0 0.14781939698282584d0 -0.0795485757950607d0 0.029375951814573274d0 -0.2373760623347922d0 -0.0582029925381927d0 -0.035078423305661834d0 -0.00451008525825166d0 -0.11079127543413236d0 0.21376441866656629d0 -0.035605651082404405d0 0.04872234965260608d0 -0.05782217548045017d0 -0.09422334470870174d0 -0.07470838742083385d0)
We can calculate cosine distance between pairs of words:
(defun cosine-distance (w1 w2)
(#"cosineDistance" *engine* w1 w2))
(cosine-distance "home" "sea")
0.23492995551523996d0
We can get a list of similar words:
(defun get-matches (word num &optional searcher)
(mapcar (lambda (x) (cons (#"getKey" x) (#"getValue" x)))
(jss:jlist-to-list
(#"getMatches" (or searcher *engine*) word num))))
(get-matches "terrible" 5)
(("terrible" . 0.9999999999999996d0) ("horrible" . 0.7570643210268546d0) ("dreadful" . 0.6858479525397655d0) ("frightful" . 0.6734959984485096d0) ("fearful" . 0.6431009514957757d0))
We can even do analogies:
(defun analogy (w1 w2 w3)
(let ((semantic-difference
(#"similarity" *engine* w1 w2)))
(car (get-matches w3 1 semantic-difference))))
(analogy "king" "queen" "man")
("woman" . 0.9041833779267661d0)
I would advise against it. The original implementation is optimized to be very fast and handles large corpuses really well. However, the original implementation does not clean the training data from punctuation marks, stop words etc., nor does it down-cases words. One must pre-process the corpus before it is fed to that particular implementation. For that purpose, you can use any language you’d like.