The Kitchen Sink and Other Oddities

Atabey Kaygun

Funniest and Unfunniest Jokes in the Jester Dataset

Description of the Problem

The Jester Dataset 1 consists of ratings of 100 specific jokes collected from 70,000 anonymized users between 1999 and 2003. Today, I am going to use SVD to decide which jokes are deemed to be the funniest and which deemed as unfunniest by these users.

Algorithm

I will form a large matrix whose rows are labelled by individual users and columns by the jokes. The matrix would be initialized to 0 at the beginning. The entry corresponding to a specific row and the column will be the rating of that specific user gave to that specific joke. I am not going to apply a tf-idf correction because I will assume that the fact that a specific joke has a higher number of ratings indicates an emotional response. Next, we seperate funny and unfunny categories: since users rank jokes from -10 (being very unfunny) to 10 (being very funny) for the “funny” category we filter out the negative ratings and for the “unfunny” category we filter out the positive ratings. Next, we apply SVD and the singular values will tell us the funniest and unfunniest jokes.

Implementation

It’s been a while since I used clojure. I will give today’s implementation in clojure.

If you follow this blog, and my earlier clojure posts, you know that I am big fan of boot. It does what I ask, but most people I know use leiningen for their needs.

I will write the code as a boot script. Here is what I need:

#!/usr/bin/env boot

(set-env! :dependencies '[[clatrix "0.5.0"]])

(require
   '[clatrix.core :as cl]
   '[clojure.java.io :as io]
   '[clojure.string :as st])

For the singular value decomposition part of the code, I am going to use clatrix, hence the dependencies addition.

(defn fn-neg [x]
   (if (< x 0) (- x) 0))

(defn fn-pos [x]
   (if (and (>= x 0) (< x 11)) x 0))

I am going to need these two functions to separate the negative (unfunny) and the positive (funny) ratings. The rating value 99 is reserved for unrated jokes. I will replace these with 0.

The entries of the vectors coming from SVD need to be normalized in some way. I am going to use the following function to do that:

(defn normalize [xs]
  (let [l1-norm (reduce + xs)
        length (count xs)]
     (map #(* % length (/ l1-norm)) xs)))

Now, here is the ranking function:

(defn rank [xs]
   (let [res (->> xs cl/matrix cl/svd)]
      (->> :right res cl/cols first normalize)))

At the end, we would like to display the text of the funniest and unfunniest jokes. You can download these from the jester project page. I unzipped those under the directory “data/jokes” residing in the data directory. The following function fetches the joke.

(defn get-joke [n]
   (-> (slurp (str "jokes/init" (+ n 1) ".html"))
       (st/split #"<!--begin of joke -->\n|<!--end of joke -->\n") 
       second))

Now, the main function:

(defn -main []
   (let* [data (->> "data/jester-data-1.csv" io/reader line-seq
                    (map (fn [x] (rest (map read-string (st/split x #","))))))
          funny (rank (map #(map fn-pos %) data))
          results (sort (fn [x y] (> (first x) (first y)))
                        (map (fn [x y] (vector x y)) funny (range (count funny))))]
      (doseq [x (take 3 results)]
         (println "----- " (first x) " ------------------------------\n")
         (println (get-joke (second x)) "\n\n\n"))))

This will display the top 3 funnies jokes. If we replace fn-pos with fn-neg we will get the top 3 unfunniest jokes. And here is the output:

-----  1.938512254843581  ------------------------------

A guy goes into confession and says to the priest, "Father, I'm 80 years
old, widower, with 11 grandchildren. Last night I met two beautiful flight
attendants. They took me home and I made love to both of them. Twice."
<P>
The priest said: "Well, my son, when was the last time you were in
confession?"
<p> "Never Father, I'm Jewish."
<p> "So then, why are you telling me?"
<p> "I'm telling everybody."




-----  1.8564830954244231  ------------------------------

Clinton returns from a vacation in Arkansas and walks down  the
steps of Air Force One with two pigs under his arms.  At the bottom
of the steps, he says  to the honor guardsman, "These are genuine
Arkansas Razor-Back Hogs.  I got this one for Chelsea and this one for
Hillary."  <P>

The guardsman replies, "Nice trade, Sir."




-----  1.8334775686052165  ------------------------------

A guy walks into a bar, orders a beer and says to the bartender,
"Hey, I got this great Polish Joke..."
<P>
The barkeep glares at him and says in a warning tone of voice:
"Before you go telling that joke you better know that I'm Polish, both
bouncers are Polish and so are most of my customers"
<P>
"Okay" says the customer,"I'll tell it very slowly."

Instead of filtering out unfunny ratings, if we were to measure them and remove funny rating, the code will be

(defn -main []
   (let* [data (->> "data/jester-data-1.csv" io/reader line-seq
                    (map (fn [x] (rest (map read-string (st/split x #","))))))
          unfunny (rank (map #(map fn-neg %) data))
          results (sort (fn [x y] (> (first x) (first y)))
                        (map (fn [x y] (vector x y)) unfunny (range (count unfunny))))]
      (doseq [x (take 3 results)]
         (println "----- " (first x) " ------------------------------\n")
         (println (get-joke (second x)) "\n\n\n"))))

The output from this version is:

-----  2.310577261546736  ------------------------------

Q. What is orange and sounds like a parrot?  <BR><BR>

A. A carrot.



-----  2.074095832155103  ------------------------------

How many teddybears does it take to change a lightbulb?
<p>
It takes only one teddybear, but it takes a whole lot of lightbulbs.



-----  1.8763955400682024  ------------------------------

They asked the Japanese visitor if they have elections in his
country.  <BR><BR>
"Every Morning" he answers.

One can also go for funniest of the unfunny jokes and unfunniest of funny jokes by playing with the sorting function and filtering out positive and negative ratings.