Entropy and
approximately one-to-one maps
Description of the problem
I thought of this problem as I enter student grades to a spreadsheet.
Here it goes:
The naive description
I have a list of students and I want to search for each student using
the search function as I enter their grades. I want to be able to find
the student for whom I am going to enter the grade with fewest possible
keystrokes. What should I do?
I have a dataset each of which is a word of length
where is the same for every data
point, and each letter
comes from a fixed alphabet . How do I find an
interval in such that is one-to-one,
or close to being one-to-one.
Here I choose to ask for an interval as the search function will
require that I enter letter/digits in sequence. The more general
question may require that I ask for a subsequence of indices from such that is
one-to-one, or close to being one-to-one.
The entropy
Entering the first few digits of the student id number is usually a
bad idea as the beginning of these numbers designate their admission
years and departments. This means you will have to enter so many digits
for the admission year, and so many years for the department etc. which
is not very random. I need the sequence I enter to separate the
students. We need a part of their numbers with the highest randomness,
a.k.a. entropy.
Imagine the hypothetical case of 100 students from two departments
taking my class: 20 from Computer Science and 80 from Electrical
Engineering. Assuming the digits I enter are the department codes, I
will get two groups: CS and EE. The entropy (the measure of randomness)
is Here I have two groups: CS students (in the group
their total measure is 20%) and EE students (they comprise of 80% of the
group). On the other hand if I found the right interval of digits to
enter I would have gotten each individual student (all 100 of them) in
its own separate group. In that case the randomness measure would have
been
This means, I will have to find the interval such that the entropy
of the probability distribution on the projected set is close to
What is the entropy of the probability distribution on the
projected set? you might ask. Imagine you have a finite set and a labeling function . Let be the number of elements in labeled by via . In other words, Then the probability of seeing
the label on the dataset is
where is the number of elements in . Hence, the entropy is In case is our alphabet and is projection on -th letter then we would like this value
to be as close to as
possible. This means we could instead minimize the sum
Implementation and test
I will assume that the dataset is given as a lisp list of strings of
length 7 consisting of decimal digits only. My dataset consists of 266
data points. I will load it from a file:
(defparameter *data*
(with-open-file (infil "hash-test.txt" :direction :input)
(let (res)
(do ((line (read-line infil nil) (read-line infil nil)))
((null line) res)
(push line res)))))
*DATA*
The objective
function will take the dataset as a
parameter together with a function fun
. It will construct a
hash table whose keys are the data points evaluated under
fun
and will count how many times a specific value appears
in the image.
(defun objective (data fun)
(let ((local (make-hash-table :test 'equal)))
(dolist (x data)
(if (gethash (funcall fun x) local)
(incf (gethash (funcall fun x) local))
(setf (gethash (funcall fun x) local) 1)))
(let ((res 0d0))
(maphash (lambda (x y) (incf res (* y (log y 2.0d0)))) local)
res)))
OBJECTIVE
Let me test this on our data on various projections:
(loop for i from 0 to 6
collect (format nil "~5,1f" (objective *data* (lambda (x) (char x i)))))
(2055.4 1634.7 1418.1 1268.9 1264.9 1261.7 1363.3)
The results indicate that I should use 4th, 5th and 6th digits to get
the best results. Let me see if the result is better if I calculate the
entropy of the projection function which uses these three digits
together.
(loop for i from 0 to 4 collect
(format nil "~5,1f" (objective *data* (lambda (x) (subseq x i (+ i 3))))))
(1141.6 460.7 129.6 55.5 87.5)
This confirms our guess: If I were to use the subsequence
(subseq x 3 6)
I will get the best result. Let us calculate
the actual entropy, and see if it is close to the maximal entropy 8.1
for the data set we have:
(let* ((N (length *data*))
(result (- (log n 2.0d0)
(* (/ 1.0d0 n)
(objective *data* (lambda (x) (subseq x 3 6)))))))
(format nil "~2,1f" result))
7.8
Analysis
The result above indicates, I should type the 4th, 5th and 6th digits
of the students’ id numbers. However, there is a danger of these numbers
colliding with subsequences starting at other positions. For example, I
can type “456” to match “1234567” and yet match also “4567890”. In order
to prevent such collisions, I better create a primary key for each
student using the 4th through 6th digits of the student numbers together
with, say the first letter of their last names. This is a key I can
easily type into the search box of my spreadsheet software as I glance
at each student’s exam paper.
Older Posts
[2025-02-23] Counting Matroids
[2025-02-12] Sampling from
a Random Variable
[2025-01-29] Markov Numbers
[2024-12-24] Number of isomorphism classes
of simple graph (continued)
[2024-12-22] Counting Isomorphism
Classes of Graphs
[2024-11-25] Connected Components
of Graphs
[2024-11-24] Counting connected
components of a graph
[2024-11-18] Counting Isomorphism
Classes of m-ary Trees
[2024-11-16] Number of
Isomorphism Classes of Ternary Trees
[2024-11-12] Hosoya Index of Balanced
Binary Trees
[2024-11-11] Hosoya Index of a
Graph
[2024-10-29] The Clique Number of a Simple
Graph
[2024-10-28] The Size of
Maximally Independent Subsets in a Graph
[2023-11-03] Graph
Algorithms in JGraphT with Common Lisp
[2023-10-28] An
Implementation of Pandas’ cut
and qcut
in
Lisp
[2023-07-24] A
Collatz-like Conjecture for the Projective Line
[2023-03-06] Twin
Primes, Cousin Primes, Sexy Primes, and Prime Triplets
[2023-03-02] Set
of All Partitions of a Finite Set
[2023-02-14] Non-crossing
Partitions and Dyck Words
[2023-02-13] Non-crossing
Linear Chords
[2023-02-04] Clojure/Python
Interop Examples
[2023-01-14] Graph
Algorithms in Clojure with JGraphT
[2022-03-29] 2D-Random Walk
[2022-03-28] Trade
Deficit vs Exchange Rate Curve
[2022-03-16] Working
with World Bank Data in Python
[2022-03-09] Working
with European Central Bank data in python (revisited)
[2022-01-24] A
Clique Analysis of Quakers in early modern Britain (1500-1700)
[2021-12-05] Boyer–Moore
and Misra-Gries Algorithms in Clojure
[2021-09-12] Tension in Text
Plotted
[2021-09-02] Statistical
Distributions using Apache Commons Math in Clojure
[2021-08-31] Reduce
with Intermediate Results in Common Lisp
[2021-08-21] Multivariate
Regression Implemented in Clojure
[2021-05-29] Using
Neural Networks to Detect Graph Properties
[2021-04-17] Fast
Null-Space Calculation via LU-Decomposition
[2021-02-24] Stoer-Wagner
Algorithm in Clojure
[2021-02-19] Calculating
Vertex Covers in Clojure
[2021-02-18] Listing All
Paths in a Graph
[2021-02-14] Strict
Dyck Words and Fibonacci Numbers
[2021-02-14] Kruskal’s
Algorithm in Common Lisp
[2021-02-13] Kruskal's
Algorithm Implemented in Clojure
[2021-02-10] An
integer dynamical system of integers
[2021-02-08] Binary
Symmetrization
[2021-01-28] Prüfer
Encoding and Decoding of a Tree in Clojure
[2021-01-27] Counting
Cycle-Free Paths in a Graph
[2021-01-27] Counting
Connected Labeled Graphs
[2020-12-18] Counting
Graphs with a Prescribed Degree Sequence
[2020-12-13] Havel–Hakimi
Algorithm in Clojure
[2020-12-12] Havel–Hakimi
Algorithm in Common-Lisp
[2020-10-23] The Quadratic
Casimir Element
[2020-07-04] Collatz Sequence
in Binary
[2020-07-02] A Lazy
Sequence of Primes in Clojure
[2020-06-10] Yet
Another Fizz-Buzz in Common Lisp
[2020-05-12] ECB
Data with Clojure and Vega-Lite
[2020-05-06] Processing
ECB Data with Common Lisp
[2020-04-17] Next
Permutation in the Lexicographical Ordering
[2020-04-13] Turkish
Hyphenation in Common Lisp
[2020-04-01] Using JavaPlex
with Clojure
[2019-11-05] Constricted
Arithmetic Progressions
[2019-11-03] The
Number of Arithmetic Progressions of Integers
[2019-05-06] Bron-Kerbosch
Algorithm in Clojure
[2019-05-01] An
Implementation of Ford-Fulkerson Algorithm in Clojure
[2019-04-22] Document
Summarization via Nonnegative Matrix Factorization
[2019-04-20] Latent
Semantic Analysis in Clojure
[2019-04-13] K-Nearest
Neighbors Algorithm in Clojure
[2019-04-06] K-Means
Implemented in Clojure
[2019-03-19] Prüfer
Encoding/Decoding of a Tree in Common Lisp
[2019-03-05] Gale-Shaply
Algorithm in Common Lisp
[2019-03-02] Calculating
The Correct Rank of a Matrix
[2018-12-04] Feed-forward
and back-propagation in neural networks as left- and right-fold
[2018-10-31] Nonnegative
Matrix Decomposition in Clojure
[2018-10-30] Non-negative
Matrix Decomposition in Scala
[2018-08-30] Working
with European Central Bank Data in Scala
[2018-07-30] Perverse
Sequences
[2018-05-28] Online
Perceptron in Common Lisp
[2018-05-26] Online Perceptron
[2018-05-18] Online Regression
[2018-05-06] Knut’s
Algorithm-S in Common Lisp
[2018-02-28] Irreducible Dyck
Words
[2018-02-19] Optimization
with GNU Scientific Library for Lisp
[2018-02-10] Van Eck’s
Sequence
[2018-02-09] Hiring
networks in mathematics
[2018-02-08] Linus Sequence
[2018-02-05] Egyptian
Fractions
[2018-02-01] Listing all
Young Tableaux
[2018-01-23] Collatz sequence
(yet again)
[2018-01-15] Hofstadter's Q
sequence
[2018-01-09] Farey Sequence
[2018-01-09] Catalan's
Triangle
[2018-01-06] The
Shoelace Formula for the Area of a Polygon
[2017-10-01] Working
with European Central Bank Data in Python
[2017-09-27] Expected
Value of the Diameter of a Tree
[2017-09-26] Using
Quandl with kixi.stats on Clojure
[2017-09-22] Using Quandl
with Common Lisp
[2017-08-05] Solving
Linear Equations in Natural Numbers
[2017-07-31] Transitive
Closure of a Directed Graph or a Relation
[2017-07-20] Steenrod-Milnor
and Tournament Sequences
[2017-07-15] A
lower bound on the radius of a graph
[2017-07-08] All partitions
of an integer
[2017-07-06] Some Hasse
Diagrams
[2017-07-04] Shuffles
[2017-07-03] Kaprekar Sequence
[2017-07-01] Lattice of Dyck
Words
[2017-06-28] The
poset of connected subgraphs of a connected graph
[2017-06-21] Calculating
the Diameter and the Radius of a Graph Using Tropic Linear
Algebra
[2017-06-19] Generating
random regular graphs
[2017-06-14] Estimating
the maximum element of a large list
[2017-06-09] A
Stochastic Gradient Descent Implementation in Clojure
[2017-06-06] A topology
problem
[2017-04-22] Listing duplicate
files
[2017-03-14] My First Idris
Proof
[2016-12-02] Distinguishing
hash functions (part II)
[2016-12-01] Distinguishing
hash functions
[2016-10-20] Hofstadter-Conway
$10,000 sequence
[2016-08-18] A
Solution for Problem 171 of 4Clojure
[2016-08-16] Puzzles and Group
Theory
[2016-08-13] Using Weka within
Lisp
[2016-07-12] Funniest
and Unfunniest Jokes in the Jester Dataset
[2016-07-05] Generating
Uniformly Random Connected Graphs
[2016-06-16] The
Robinson-Schensted Algorithm
[2016-06-01] Conjugate
Partitions
[2016-04-27] Using Word2Vec
from Clojure
[2016-04-24] Using
Word2Vec from Common Lisp
[2016-04-18] A Migration
Analysis
[2016-04-11] Basic
Data Analysis with CL without Frameworks
[2016-03-25] Parallel
map-reduce in Common Lisp
[2016-02-22] Text
Summarization and Topic Analysis
[2016-01-27] Set Covering
Problem
[2016-01-25] Kolmogorov-Smirnov
Test
[2016-01-20] Eigen-values
of the Laplacian and Connected Components of a Graph
[2015-12-12] Dual Graphs
[2015-10-26] Longest
Increasing Subsequence Revisited
[2015-10-16] Document
Summarization via Markov Chains
[2015-10-07] Computational
Literary Analysis
[2015-09-30] Library of
Babel in Common Lisp
[2015-09-28] Merging
Association Lists in Common Lisp
[2015-07-22] Cheapest
Paths via Tropic Matrices
[2015-07-21] Hidden
Markov Models via Tropic Matrices
[2015-07-08] A non-technical
post
[2015-06-28] An
implementation of the Viterbi algorithm in Common Lisp
[2015-05-28] Greatest
Common Divisor of Two Rational Numbers
[2015-05-21] Partitions
of Equal Measure Whatever the Measure May Be
[2015-05-14] Finding Cliques
in a Graph
[2015-05-12] Set Cover Problem
[2015-05-03] Threading
Macros in Common Lisp
[2015-05-03] Happy Numbers
[2015-05-01] Collatz Primes
[2015-04-23] Splitting Streams
[2015-04-06] Hamming
Distance and Double Hashing
[2015-04-05] Hamming
Distance and Hashing Functions
[2015-04-05] Hamming
Derivative of Hashing Functions
[2015-04-02] A Topology
Problem
[2015-03-21] Curve
Fitting is a Gram-Schmidt Reduction
[2015-03-08] Maximum
number of characters using keystrokes A, Ctrl+A, Ctrl+C and
Ctrl+V
[2015-03-06] Eccentricity,
Radius and Diameter in a Graph, Revisited
[2015-03-01] Graphs and
Entropy
[2015-02-22] Math PhD
Hiring Network (Part 3)
[2015-02-19] Math PhD
Hiring Network (Part 2)
[2015-02-18] Math PhD
Hiring Network (Part 1)
[2015-02-17] Faculty
Networks and Inequality in Hiring Practices in Universities
[2015-02-10] Functional
Streams in Lisp Explained
[2015-02-05] Collatz-type
Conjectures (Continued)
[2015-02-04] Collatz-type
Conjectures (Continued)
[2015-01-31] Collatz-type
Conjectures (Continued)
[2015-01-30] Collatz-type
Conjectures
[2015-01-28] Experiments
with Infinite Recursive Sequences (continued)
[2015-01-17] Experiments
with Infinite Recursive Sequences
[2015-01-10] Goldbach Pairs
[2015-01-02] Collatz Lengths
(Continued)
[2015-01-01] Functional
Streams
[2014-12-27] Polarization
in the US Congress
[2014-12-18] Partition a
sequence
[2014-11-28] Uniformly
Random Permutations
[2014-11-22] An
Implementation of Ford-Fulkerson Algorithm in Common Lisp
[2014-11-17] Tropic
Calculation of Cheapest Paths
[2014-11-05] Longest
common subsequence of two sequences
[2014-10-30] Counting
Spanning Trees of a Graph
[2014-10-26] Longest
Increasing Subsequence
[2014-10-24] The
Number of Inversions in a Sequence
[2014-10-22] Hashes and
Entropy
[2014-10-09] Estimating
Cardinality with Constant Memory Complexity
[2014-09-30] Landau's Function
[2014-09-29] A
Problem on Substitution Ciphers and Group Theory
[2014-09-28] A Morse Code
Translator
[2014-09-23] A
Memoization Macro for Common Lisp
[2014-09-21] Reducers are
Monoid Morphisms
[2014-09-18] Number
of isomorphism classes of binary trees
[2014-09-07] CONS is your
friend
[2014-08-22] A Zipf's Law
Simulation
[2014-08-07] Generating
Uniformly Random Trees
[2014-07-09] A Solution
for Project Euler #463
[2014-06-12] Entropy of
truncated MD5 hashing
[2014-06-08] Hexadecimal digits
of π
[2014-02-11] Information
content of n-grams
[2014-02-08] Turkish
Sentiment Analysis Using Thesaurus Distance
[2014-02-01] Sentiment
analysis using word distances
[2014-01-27] Phase
transitions in entropy
[2013-12-13] Optimal length of
n-grams
[2013-12-10] Counting
strings that contain intervals of same letter repetitions
[2013-12-02] Patterns
Separating Large Texts
[2013-11-23] Collatz
Sequences (Continued)
[2013-11-11] Entropy
and approximately one-to-one maps
[2013-10-23] Tree Isomorphism
[2013-10-15] Self Organizing
Maps
[2013-09-15] Euler Project
#401
[2013-09-15] An
additively recursive definition of the Moebius function
[2013-09-11] An
Unsuccessful Attempt for Solving Euler Project #401
[2013-09-04] Uniform
Sampling from Parametrized Submanifolds in Scala
[2013-09-04] Uniform
Sampling from Parametrized Submanifolds
[2013-08-30] Randomly
Generated Points Obeying a Distribution
[2013-08-25] Simulated
Annealing in Lisp
[2013-08-21] Eigenvalues
and Eigenvectors in GSLL
[2013-08-16] Reservoir
Sampling
[2013-08-11] Gibbs
sampling in lisp compared with C
[2013-08-10] Logistic
Regression in lisp
[2013-08-10] Linear
Discriminant Analysis in R
[2013-07-17] A
Gradient Descent Implementation in Lisp
[2013-07-01] k-Nearest
Neighbor Classification Algorithm Implemented in Lisp
[2013-05-19] Newton-Raphson
Method
[2013-05-07] Levenshtein
Distance
[2013-04-15] Cut points in a
graph
[2013-04-01] Experiments
on Collatz Lengths (Continued)
[2013-02-18] The
sound of the torsion parts of homotopy groups of spheres
[2013-02-12] Monadic Units
[2013-02-07] Distribution
of Collatz Lengths (continued)
[2013-02-03] Distribution
of Collatz Lengths
[2013-01-31] Quotients
of polynomial algebras
[2013-01-12] Path ideals
[2013-01-10] McCarthy91
Terminates
[2013-01-09] Finding
all paths in a directed graph
[2013-01-04] A
Simple Monte-Carlo Integration Implementation in Lisp
[2012-12-30] A
simple problem in Kolmogorov-Chaitin complexity
[2012-12-29] From walks to
paths
[2012-12-16] Higher
order functions, functors and monads
[2012-12-13] Eccentricity,
Radius and Diameter in an Undirected Graph
[2012-11-29] Untitled
[2012-11-25] Strictly
Increasing Labels of Directed Graphs
[2012-11-19] Strictly
Increasing Labellings of Directed Graphs
[2012-11-17] Nilpotent
elements in an artinian algebra
[2012-11-04] Local
rings, idempotents and non-invertible elements
[2012-10-18] An
implementation of the fixed-radius near neighbor clustering algorithm in
lisp
[2012-10-15] Reducing directed
graphs
[2012-10-10] An
implementation of the k-means
clustering algorithm in lisp
[2012-10-08] A
comparison of different map functions in lisp
[2012-10-03] Source code
entropy
[2012-09-28] Collisions in
random walks
[2012-09-26] Transitive
closure of a directed graph
[2012-09-26] Solving
linear equations in ℕ
[2012-09-26] Listing
partitions
[2012-09-26] Inverting
formal power series
[2012-09-26] Hasse
subgraph of a directed graph