The Kitchen Sink and Other Oddities

Atabey Kaygun

Using JavaPlex with Clojure

Java isn’t great when it comes to heavy numerical computations unless it gets aid from native libraries. But, nevertheless, I am going to show you an example of using the topological data analysis java library JavaPlex developed by the applied topology group at Stanford University. My example code below is written in Clojure.

Topological Data Analysis

Topologists have known a very important fact since the inception of topology back in the beginning of the 20th Century: one can effectively approximate continuous structures, such as manifolds (surfaces and embedded bodies in the 3d space if you are so-inclined), using discrete structures and calculate their continuous invariants (such as dimension, number of holes etc.) using the said discrete approximations. So, why not do the same thing with collection data points?

Here is a mathematician’s way of describing the fundamental problematic of topological data analysis:

Assume we have an embedded manifold \(M\) in \(\mathbb{R}^n\), and a finite sample of points \(X\) taken from from \(M\). Can one calculate topological invariants of \(M\) from \(X\).

The topological invariants we are after are called Betti numbers. For example, the 0-th betti number calculates the number of connected components, 1-st betti number calculates the number of circle-like features, 2-nd betti number calculates the number of sphere-like features, and so on…

No Free Lunch

In theory, topology, unlike geometry, is insensitive to small changes to the underlying metric (distance function). However, how these finite cloud of points are chosen (sampling) does effect the results. This means, if sampled carefully (dare I say correctly), calculations would give almost correct answers. So far so good.

But, unfortunately, there is no such a thing as a free lunch. In order to calculate the betti numbers we need to construct a simplicial complex and then construct its singular complex. The size of the simplicial complex on \(n\) points (in the worst case scenario) is \(n!\approx n^n\) (worse than exponential!) which makes calculations tricky to say the least.

Barcodes

From our cloud of points we first construct a gradually growing network of points depending on their proximity. One key point here is that even though we need a notion of proximity of points in constructing the network, the topological invariants we are going to calculate at the end, are independent of any choices we make in the process.

In the process of enlarging the network, some topological features might appear (new polygonal features) and some might disappear (some polygons features might fill-up and disappear). These features are encoded with things called homology classes, and betti numbers count how many homology classes of a certain degree there are.

Barcodes are records of homology classes appearing (their birth) and disappearing (their death) as a pair of time-stamps.

img

[Source]

If you’d like to learn more, I would recommend the lectures by Frédéric Chazal and Julien Tierny at INRIA.

JavaPlex and Clojure

There is an excellent tutorial on how to use JavaPlex, you should read it. What I do down here is somewhat different, but you can adjust the examples there for the code below. Biggest difference is that I did write my own sampling code.

Let’s do some coding! Unfortunately, the jar files of the JavaPlex library are not on Maven. So, you would have to download it. You have two options:

  1. You can declare it a local resource in your project.clj file,
  2. You can add it to your local maven repository.

I went with the second option.

Let me start with the namespace stuff:

(ns tda.core
  (:import edu.stanford.math.plex4.api.Plex4)
  (:import java.util.Random)
  (:gen-class))

In order to test the code, I need sample points from a nice topological object. In my case this is the n-sphere

(defn sample-from-Sn [dim n r epsilon]
  (let [rng (Random.)
        rs (take n (repeatedly (fn [] (+ r (* epsilon (.nextGaussian rng))))))]
    (loop [m dim
           points (take n (repeatedly (fn [] [(rand-nth [-1 1])])))]
        (if (zero? m)
          (map (fn [xs r] (map (fn [x] (* x r)) xs)) points rs)
          (let [thetas (take n (repeatedly (fn [] (* Math/PI (.nextDouble rng)))))]
            (recur (dec m)
                   (map (fn [point theta]
                          (conj (map (fn [x] (* x (Math/cos theta))) point)
                                (Math/sin theta)))
                        points thetas)))))))
#'tda.core/sample-from-Sn

Parameter dim controls the dimension of the sphere, n tells the size of the sample, r controls the radius of the sphere and epsilon controls the variance of the error I added to the radius. I must warn: my samples are not uniformly distributed. If you plot this for the 2-dimensional sphere, you will see accumulation on the poles. But, it will do here.

Okay, let me sample some points from \(S^2\) embedded in \(\mathbb{R}^3\):

(def S2-clojure (sample-from-Sn 2 60 1.0 0.05))
(into [] S2-clojure)
#'tda.core/S2-clojure
[(0.9694257845977597 0.2565225846251115 0.149693530313045)
(0.3160506015340551 -0.08420911142786046 -0.9730145781728613) 
(0.2441359514617179 -0.6387846652366139 -0.7411718962603825) 
(0.17159371543720592 -0.9472605873221599 -0.5206384637048138)
(0.8191247892843844 -0.5051582258691645 -0.15355993023272338) 
(0.08761606749846149 -1.0593979203996742 -0.18669956905276947) 
(1.0351699217914776 -0.16973081443032012 -0.05548596986325099) 
(0.6321192170240967 -0.4463243263048794 -0.6331108910818303) 
(0.44110576960908393 0.7901333222386744 -0.17699391506183906) 
(0.6664362117540702 -5.182362640554818E-4 -0.7878044433470869) 
(0.23099494868770934 1.0035143441460654 0.15219310831575988) 
(0.3793772572961613 0.9211308061484103 0.3368188034751347)
(0.9391765946251789 -0.5149998723423344 0.00951410804395951) 
(0.5850653567514654 -0.1440059217893346 0.7914547828690756)
(0.3508641770888623 -0.8216806863532357 0.3210694225742778) 
(0.365271058828488 0.467372201564975 -0.7477195994475816) 
(0.9998784904968095 -0.2504306059505103 0.17057029208945632) 
(0.37153553168598225 -0.8371234147021953 -0.4039289385911284) 
(0.1880595946327987 0.872119289752966 -0.25457609711238316) 
(0.9078318321103805 0.11443429289075147 0.30985628091744377) 
(0.2753178550631587 0.5522157044155734 0.724835568584973) 
(0.7276416387195462 0.43091836672710043 -0.5955978441028754) 
(0.928458266429133 0.34186663945326023 0.05966688766915473) 
(0.9483252756222065 0.03997832071469546 -0.03415491558412182) 
(1.039819688540467 -0.0029375076770020523 -0.05596330134040296) 
(0.018988746613726758 -0.8139128755155158 0.5399992866645357) 
(0.75171740780237 -0.38526794255364827 0.17028688698667463) 
(0.7046546908705352 0.45716437242694546 -0.38071293516071086) 
(0.9672486476322995 0.04443394765956402 0.13088126355371188) 
(0.814551798395074 -0.4884786291897059 -0.41141139951264294) 
(0.8059618509573785 0.2578346913295543 0.4784906568264699)
(0.5851445796338018 0.17210563813682658 -0.729820796773424) 
(1.022865088634098 -0.06132159806564196 -0.22144052975566123) 
(0.8019568682509614 0.058094639441568514 -0.6154093307097448) 
(0.8749755458384921 0.1779446617970637 -0.5069180924762302) 
(0.7561408596423005 0.6209311431238985 0.3056639783082412)
(0.020146319348906944 -0.620716247392654 0.8449959844580252) 
(0.7308601008956529 0.08219526175260423 -0.790211858586775) 
(0.43137452712078234 -0.37506259931597696 -0.8054673920079827) 
(0.8410469276246595 -0.4552426709132372 -0.014705503002436397) 
(1.0196327878770284 0.017668451272742396 0.13452291759828935) 
(0.7881072022617578 -0.40181778621391795 -0.23440165612318994) 
(0.9407412796388347 -0.015940593773706064 0.030361770156930493) 
(0.958641455522085 -0.04760662629066117 0.037604490626258216)
(0.7975644193823939 0.07790949438593872 -0.4407502726054776) 
(0.8723001721910031 -0.18197209769171202 0.3734974136910884) 
(1.0075286251968685 -0.07024816367980981 0.0012243200449112126) 
(0.8914068849160154 -0.16636570799712203 0.09751276054119445) 
(0.7248253421185545 -0.7441517660417161 -0.009518393915065062) 
(0.46741625025217876 0.5982647783111011 -0.4012345471412837) 
(0.9395881922522764 -0.28054294602773683 -0.13171366106619273) 
(0.6550895775349393 -0.6990611847238194 0.467956330584316)
(0.48736324839252215 0.09402215688413616 -0.9338551699907922)
(0.9589695669528585 0.16568955031112448 -0.09542644569897873) 
(0.8726612994318831 -0.010564243542695102 0.6340352123016327) 
(1.0292743257857333 -0.18881987244771503 0.11154210093830234) 
(0.7993825384808707 0.3512141469954968 -0.5124549756863521) 
(0.47595730072279946 0.9016441364280734 -0.18816975248168527) 
(0.6220383817799092 -0.29058614371905445 0.6718158125861152) 
(0.5954384246793762 -0.4642050574372491 -0.6583431440198491)]

Since this is a java library, we need to play nice and convert the data point cloud into a java array:

(defn into-java [xs]
   (into-array (map double-array xs)))

(def S2 (into-java S2-clojure))
#'tda.core/into-java
#'tda.core/S2

Next, we need to construct a simplicial complex. JavaPlex has a built-in functions for that. Below, I chose to construct a Rips complex which is computationally cheaper to other alternatives such a Čech complex.

(def rips (Plex4/createVietorisRipsStream S2 3 2.25 100))
#'tda.core/rips

The second parameter controls the maximum dimension of the simplicial complex, and the last parameter controls the upper bound for the proximity parameters in the filtration.

Next, we have a java-ism: we need to instantiate a class for the algorithm that calculates the barcodes.

(def algo (Plex4/getRationalSimplicialAlgorithm 3))
#'tda.core/algo

The parameter we pass to the function tells us the upper bound for the dimension it needs to check.

Now, moment of truth. Calculate the barcodes. Recall that these are collections of time-stamps of births and deaths of homology classes along with the dimension of the homology class.

(.computeIntervals algo rips)
Dimension: 0
[0.0, 0.0225)
[0.0, 0.045)
[0.0, 0.045)
[0.0, 0.045)
[0.0, 0.0675)
[0.0, 0.09)
[0.0, 0.09)
[0.0, 0.09)
[0.0, 0.11249999999999999)
[0.0, 0.11249999999999999)
[0.0, 0.11249999999999999)
[0.0, 0.11249999999999999)
[0.0, 0.11249999999999999)
[0.0, 0.11249999999999999)
[0.0, 0.135)
[0.0, 0.135)
[0.0, 0.135)
[0.0, 0.135)
[0.0, 0.135)
[0.0, 0.135)
[0.0, 0.1575)
[0.0, 0.1575)
[0.0, 0.1575)
[0.0, 0.18)
[0.0, 0.18)
[0.0, 0.18)
[0.0, 0.18)
[0.0, 0.18)
[0.0, 0.18)
[0.0, 0.18)
[0.0, 0.20249999999999999)
[0.0, 0.20249999999999999)
[0.0, 0.20249999999999999)
[0.0, 0.22499999999999998)
[0.0, 0.22499999999999998)
[0.0, 0.22499999999999998)
[0.0, 0.2475)
[0.0, 0.2475)
[0.0, 0.2475)
[0.0, 0.2475)
[0.0, 0.27)
[0.0, 0.27)
[0.0, 0.27)
[0.0, 0.2925)
[0.0, 0.2925)
[0.0, 0.2925)
[0.0, 0.315)
[0.0, 0.33749999999999997)
[0.0, 0.33749999999999997)
[0.0, 0.33749999999999997)
[0.0, 0.36)
[0.0, 0.36)
[0.0, 0.36)
[0.0, 0.3825)
[0.0, 0.3825)
[0.0, 0.40499999999999997)
[0.0, 0.4275)
[0.0, 0.4275)
[0.0, 0.54)
[0.0, infinity)
Dimension: 1
[0.09, 0.11249999999999999)
[0.22499999999999998, 0.2475)
[0.2475, 0.2925)
[0.2925, 0.33749999999999997)
[0.315, 0.33749999999999997)
[0.315, 0.33749999999999997)
[0.40499999999999997, 0.4275)
[0.44999999999999996, 0.5175)
[0.33749999999999997, 0.5625)
[0.5175, 0.5625)
[0.4725, 0.6975)
[0.6074999999999999, 0.72)

Analysis

As the proximity parameter grows, eventually we get one large blob. So, there is always a 0-class with infinite life span. The life spans of higher classes are always finite. Thus one has to pay attention to how long these classes survive, and make a judgement call as to which of those are real-features and which are ghost-artifacts.

The topological space 2-sphere has a very specific homology: one class at degree 0, and one class at degree 2. The calculation above, depending on the sampling, most often cannot capture the top degree class. Bummer! If I increase the number of points, my laptop becomes a lap-heating-device without much gain. Yet another bummer!