As much as I like implementing machine learning algorithms from scratch within various languages I like using, in doing serious research one should not take the risk of writing error-prone code. Most likely somebody already spent many thousand hours writing, debugging and optimizing code you can use with some effort. Re-use people, re-use!
In any case, today I am going to describe how one can use weka libraries within ABCL implementation of common lisp. Specifically, I am going to use the k-means implementation of weka.
First, I need to dynamically load the weka into our classpath:
(mapc #'require '(:abcl-asdf :cl-ppcre))
(java:add-to-classpath
(abcl-asdf:as-classpath
(abcl-asdf:resolve "nz.ac.waikato.cms.weka:weka-stable")))
Weka libraries require that the data fed into them come in arff format. So, we are going to need a function that loads an arff file.
(defun load-arff-file (file-name)
(jnew "weka.core.Instances"
(jnew "java.io.FileReader" file-name)))
For the tests today, I am going to use an arff file I made from the wine dataset.
(defvar wine (load-arff-file "weka-wine.arff"))
If you’d like to see the methods the java class of the data comes
with you should call jss:java-class-method-names
on
wine
:
(jss:java-class-method-names wine)
("notifyAll" "notify" "getClass" "wait" "retainAll" "removeAll"
...
"replaceAttributeAt" "readInstance" "randomize" "numInstances"
"numDistinctValues" "numClasses" "numAttributes")
If you’d like to get method’s type signatures, you are better off
calling jclass-methods
on
(jss:find-java-class "weka.core.Instances")
. But, beware:
the raw return vector is not human readable. I find the following
function useful:
(defun find-method-with-signature (method-name class-name)
(delete-if-not (lambda (x) (ppcre:scan method-name (#"toString" x)))
(coerce (jclass-methods class-name) 'list)))
For example,
(find-method-with-signature "numInstances" "weka.core.Instances")
(#<method public int weka.core.Instances.numInstances()>)
means numInstances
accepts no inputs, but
(find-method-with-signature "deleteAttributeAt" "weka.core.Instances")
(#<method public void weka.core.Instances.deleteAttributeAt(int)>)
the method deleteAttributeAt
takes an integer input.
Let’s get back to our data:
(list (#"numInstances" wine) (#"numAttributes" wine))
(178 14)
We have 178 data points each of which has 14 dimensions.
(#"get" wine 0)
#<weka.core.DenseInstance 14.23,1.71,2.43,15.6,127,2.8,3.0.... {4C58DE9}>
OK. We got a weka class. We need to convert it to a lisp object:
(jss:jarray-to-list (#"toDoubleArray" (#"get" wine 0)))
(14.23d0 1.71d0 2.43d0 15.6d0 127.0d0 2.8d0 3.06d0 0.28d0 2.29d0
5.64d0 1.04d0 3.92d0 1065.0d0 1.0d0)
The last entry is the class label of the wine. We better delete it:
(#"deleteAttributeAt" wine 13)
(jss:jarray-to-list (#"toDoubleArray" (#"get" wine 0)))
(14.23d0 1.71d0 2.43d0 15.6d0 127.0d0 2.8d0 3.06d0 0.28d0 2.29d0 5.64d0 1.04d0 3.92d0 1065.0d0)
But, I am going to need to extract a similar information below. So, let us write a function for it:
(defun extract-vectors (instances)
(let ((num (#"numInstances" instances))
(res nil))
(dotimes (i num (nreverse res))
(push (jss:jarray-to-list (#"toDoubleArray" (#"get" instances i))) res))))
Now, let us create the weka’s k-means clusterer from this data:
(defvar k-means (jnew "weka.clusterers.simpleKMeans"))
(jss:java-class-method-names k-means)
("notifyAll" "notify" "getClass" "hashCode" "equals" "wait"
...
"getDistanceFunction" "distanceFunctionTipText"
"getDontReplaceMissingValues" "setDontReplaceMissingValues"
"dontReplaceMissingValuesTipText" "getDisplayStdDevs")
The methods I am interested today are setNumClusters
and
buildClusterer
.
(find-method-with-signature "setNumClusters" "weka.clusterers.SimpleKMeans")
(#<method public void weka.clusterers.SimpleKMeans.setNumClusters(int) throws java.lang.Exception>)
(find-method-with-signature "buildClusterer" "weka.clusterers.SimpleKMeans")
(#<method public void weka.clusterers.SimpleKMeans.buildClusterer(weka.core.Instances) throws java.lang.Exception>)
OK. We call these methods as follows:
(#"setNumClusters" k-means 3)
(#"buildClusterer" wine)
For the problem I don’t need the clusters themselves, but rather, the cluster centroids. So,
(extract-vectors (#"getClusterCentroids" k-means))
((13.732166666666666d0 2.005000000000001d0 2.458d0
17.253333333333334d0 106.88333333333334d0 2.8478333333333334d0
2.9808333333333326d0 0.2886666666666667d0 1.9003333333333334d0
5.492000000000001d0 1.0661666666666667d0 3.1635d0
1113.5333333333333d0 1.0166666666666666d0) (13.151632653061228d0
3.344489795918368d0 2.43469387755102d0 21.43877551020408d0
99.0204081632653d0 1.6781632653061223d0 0.7979591836734694d0
0.45081632653061215d0 1.1630612244897958d0 7.343265285714285d0
0.6859183673469389d0 1.6902040816326522d0 627.5510204081633d0
2.979591836734694d0) (12.257246376811594d0 1.9085507246376812d0
2.2385507246376806d0 20.06376811594203d0 94.04347826086956d0
2.2526086956521745d0 2.0762318840579708d0 0.36231884057971014d0
1.6256521739130436d0 3.0579710144927543d0 1.0557391304347825d0
2.7862318840579707d0 512.8260869565217d0 2.0d0))
The example I gave above is a toy example. I used weka’s clusterer on a large dataset (~150K data points where each point had 100 numerical attributes) and run the code to generate 1024 clusters. I ran it on my portable notebook, and it was slow. From what I could see, the library ran only on a single core. My guess it is not parallelized.