Entropy and approximately one-to-one maps

Description of the problem

I thought of this problem as I enter student grades to a spreadsheet. Here it goes:

The naive description

I have a list of students and I want to search for each student using the search function as I enter their grades. I want to be able to find the student for whom I am going to enter the grade with fewest possible keystrokes. What should I do?

The formal description

I have a dataset $x^{(1)}, \dots, x^{(N)}$ each of which is a word $x^{(i)} = (x_{1}^{(i)} \dots x_{m}^{(i)})$ of length $m$ where $m$ is the same for every data point, and each letter $x_{j}^{(i)}$ comes from a fixed alphabet $A$ . How do I find an interval $j, j + 1, \dots, j + ℓ$ in ${1, \dots, m}$ such that $f (x^{(i)}) = (x_{j}^{(i)} x_{j + 1}^{(i)} \dots x_{j + ℓ}^{(i)})$ is one-to-one, or close to being one-to-one.

Here I choose to ask for an interval as the search function will require that I enter letter/digits in sequence. The more general question may require that I ask for a subsequence of indices $j_{1}, \dots, j_{ℓ}$ from ${1, \dots, m}$ such that $f (x^{(i)}) = (x_{j_{1}}^{(i)} x_{j_{2}}^{(i)} \dots x_{j_{ℓ}}^{(i)})$ is one-to-one, or close to being one-to-one.

The entropy

Entering the first few digits of the student id number is usually a bad idea as the beginning of these numbers designate their admission years and departments. This means you will have to enter so many digits for the admission year, and so many years for the department etc. which is not very random. I need the sequence I enter to separate the students. We need a part of their numbers with the highest randomness, a.k.a. entropy.

Imagine the hypothetical case of 100 students from two departments taking my class: 20 from Computer Science and 80 from Electrical Engineering. Assuming the digits I enter are the department codes, I will get two groups: CS and EE. The entropy (the measure of randomness) is $- 0.2 \log_{2} (0.2) - 0.8 \log_{2} (0.8) \approx .722$ Here I have two groups: CS students (in the group their total measure is 20%) and EE students (they comprise of 80% of the group). On the other hand if I found the right interval of digits to enter I would have gotten each individual student (all 100 of them) in its own separate group. In that case the randomness measure would have been $\sum_{i = 1}^{100} - 0.01 \log_{2} (0.01) = - \log_{2} (0.01) = \log_{2} (100) \approx 6.644$

This means, I will have to find the interval $j, j + 1, \dots, j + ℓ$ such that the entropy of the probability distribution on the projected set $π (x^{(i)}) = (x_{j}^{(i)} \dots x_{j + ℓ}^{(i)})$ is close to $- N \cdot \frac{1}{N} \log_{2} (1 / N) = \log_{2} (N)$

What is the entropy of the probability distribution on the projected set? you might ask. Imagine you have a finite set $X$ and a labeling function $f : X \to A = {a_{1}, \dots, a_{t}}$ . Let $n_{i}$ be the number of elements in $X$ labeled by $a_{i} \in {a_{1}, \dots, a_{t}}$ via $f$ . In other words, $n_{i} = ∥ {x \in X ∥ f (x) = a_{i}} ∥$ Then the probability of seeing the label $a_{i}$ on the dataset is $n_{i} / N$ where $N$ is the number of elements in $X$ . Hence, the entropy is $\sum_{i = 1}^{t} - \frac{n_{i}}{N} \log_{2} (n_{i} / N) = \log_{2} (N) - \frac{1}{N} \sum_{i = 1}^{t} n_{i} \log_{2} (n_{i})$ In case $A$ is our alphabet and $f$ is projection on $i$ -th letter then we would like this value to be as close to $\log_{2} (N)$ as possible. This means we could instead minimize the sum $\sum_{i = 1}^{t} n_{i} \log_{2} (n_{i})$

Implementation and test

I will assume that the dataset is given as a lisp list of strings of length 7 consisting of decimal digits only. My dataset consists of 266 data points. I will load it from a file:

(defparameter *data* 
   (with-open-file (infil "hash-test.txt" :direction :input)
      (let (res)
         (do ((line (read-line infil nil) (read-line infil nil)))
             ((null line) res)
           (push line res)))))

*DATA*

The objective function will take the dataset as a parameter together with a function fun. It will construct a hash table whose keys are the data points evaluated under fun and will count how many times a specific value appears in the image.

(defun objective (data fun)
   (let ((local (make-hash-table :test 'equal)))
      (dolist (x data)
         (if (gethash (funcall fun x) local)
           (incf (gethash (funcall fun x) local))
           (setf (gethash (funcall fun x) local) 1)))
      (let ((res 0d0)) 
         (maphash (lambda (x y) (incf res (* y (log y 2.0d0)))) local)
         res)))

OBJECTIVE

Let me test this on our data on various projections:

(loop for i from 0 to 6 
      collect (format nil "~5,1f" (objective *data* (lambda (x) (char x i)))))

(2055.4 1634.7 1418.1 1268.9 1264.9 1261.7 1363.3)

The results indicate that I should use 4th, 5th and 6th digits to get the best results. Let me see if the result is better if I calculate the entropy of the projection function which uses these three digits together.

(loop for i from 0 to 4 collect
    (format nil "~5,1f" (objective *data* (lambda (x) (subseq x i (+ i 3))))))

(1141.6 460.7 129.6  55.5  87.5)

This confirms our guess: If I were to use the subsequence (subseq x 3 6) I will get the best result. Let us calculate the actual entropy, and see if it is close to the maximal entropy 8.1 for the data set we have:

(let* ((N (length *data*))
       (result (- (log n 2.0d0) 
                  (* (/ 1.0d0 n) 
                     (objective *data* (lambda (x) (subseq x 3 6)))))))
   (format nil "~2,1f" result))

7.8

Analysis

The result above indicates, I should type the 4th, 5th and 6th digits of the students’ id numbers. However, there is a danger of these numbers colliding with subsequences starting at other positions. For example, I can type “456” to match “1234567” and yet match also “4567890”. In order to prevent such collisions, I better create a primary key for each student using the 4th through 6th digits of the student numbers together with, say the first letter of their last names. This is a key I can easily type into the search box of my spreadsheet software as I glance at each student’s exam paper.