Source code entropy
Description of the problem
There is a very beautiful way of calculating the information content
of a channel. This measure, called Shannon
Entropy , measures the randomness of a signal. And the idea is that
the more random the signal is the more information it contains.
Today, I am going to apply this idea to randomly collected source
code from various languages.
Calculation
The entropy of a text which uses an alphabet of symbols is where is the probability of seeing -th letter in the text, assuming we
order our alphabet .
Thus if we had all of our letters appearing equally likely, the entropy
would have been In our case we
consider the letters 33 to 126 in the ASCII table. This is a total of 94
letters giving .
Implementation
For the purposes I set up, I need to calculate the entropy of a given
text. This could be a news article, or source code or a text of a web
page etc. Here is a C
program doing exactly that. It reads
an input text from standard input and calculates the entropy and the
number of characters minus the white space.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
int main() {
unsigned long table[256], total = 0;
double entropy = 0;
char c;
short i;
for(i=0; i<256; i++)
table[i] = 0;
while((c = getchar())!=EOF)
table[c]++;
for(i=33;i<127;i++)
if(table[i]) {
total += table[i];
entropy -= log(table[i])*table[i];
}
entropy /= total;
entropy += log(total);
printf("Total Characters: %6d\n",total);
printf("Entropy: %5.6f\n",entropy);
exit(0);
}
Now, I need some randomly collected source code from various
languages. As a reference I am going to calculate the entropy for texts
from natural language. For that I used few Jack London books from Project Gutenberg ’s website. As for
the source code from different languages, it would be best to compare
the codes which essentially perform the same task to reduce the
variation. For that reason, I used AIMA ’s code repository
which provide code bases in languages java, C++, lisp, prolog and
python.
Here are the results for these languages:
c++
407737
3.476
java
1349032
3.638
lisp [case sensitive]
287749
3.595
lisp [not case sensitive]
287749
3.397
prolog
102322
3.679
python
152445
3.603
Jack London
1859084
3.127
Totally random [case sensitive]
4.543
Totally random [not case sensitive]
4.220
In the graph above, I normalized the numbers by subtracting the
Shannon Entropy value for the texts from Jack London.
Interpretation
The natural language text is the least random, thus has the minimal
raw information content. I found it surprizing that java and prolog were
close and generated the highest values. On the outset, java and prolog
are very very different languages. But yet in terms of raw information
content of the source code they are close. This interesting. On the
other hand python and lisp were in the same class in this comparison if
were to treat the lisp code as case sensitive. Once we treat lisp as
case insensitive the raw information content drops sharply. However, the
drop from case-sensitive to not-case-sensitive in lisp [which is roughly
6%] is similar to the drop in the fully random case [which is roughly
7%]. And in terms of source code raw information content c++ had the
smallest value among all case sensitive languages. The case insentive
lisp code has the smallest raw information content.
One big shortcoming of the
methodology.
I should have removed all comments from the sources to compare the
parts of the code that actually do the real work, but I did not.
Older Posts
[2025-02-23 ] Counting Matroids
[2025-02-12 ] Sampling from
a Random Variable
[2025-01-29 ] Markov Numbers
[2024-12-24 ] Number of isomorphism classes
of simple graph (continued)
[2024-12-22 ] Counting Isomorphism
Classes of Graphs
[2024-11-25 ] Connected Components
of Graphs
[2024-11-24 ] Counting connected
components of a graph
[2024-11-18 ] Counting Isomorphism
Classes of m -ary Trees
[2024-11-16 ] Number of
Isomorphism Classes of Ternary Trees
[2024-11-12 ] Hosoya Index of Balanced
Binary Trees
[2024-11-11 ] Hosoya Index of a
Graph
[2024-10-29 ] The Clique Number of a Simple
Graph
[2024-10-28 ] The Size of
Maximally Independent Subsets in a Graph
[2023-11-03 ] Graph
Algorithms in JGraphT with Common Lisp
[2023-10-28 ] An
Implementation of Pandas’ cut
and qcut
in
Lisp
[2023-07-24 ] A
Collatz-like Conjecture for the Projective Line
[2023-03-06 ] Twin
Primes, Cousin Primes, Sexy Primes, and Prime Triplets
[2023-03-02 ] Set
of All Partitions of a Finite Set
[2023-02-14 ] Non-crossing
Partitions and Dyck Words
[2023-02-13 ] Non-crossing
Linear Chords
[2023-02-04 ] Clojure/Python
Interop Examples
[2023-01-14 ] Graph
Algorithms in Clojure with JGraphT
[2022-03-29 ] 2D-Random Walk
[2022-03-28 ] Trade
Deficit vs Exchange Rate Curve
[2022-03-16 ] Working
with World Bank Data in Python
[2022-03-09 ] Working
with European Central Bank data in python (revisited)
[2022-01-24 ] A
Clique Analysis of Quakers in early modern Britain (1500-1700)
[2021-12-05 ] Boyer–Moore
and Misra-Gries Algorithms in Clojure
[2021-09-12 ] Tension in Text
Plotted
[2021-09-02 ] Statistical
Distributions using Apache Commons Math in Clojure
[2021-08-31 ] Reduce
with Intermediate Results in Common Lisp
[2021-08-21 ] Multivariate
Regression Implemented in Clojure
[2021-05-29 ] Using
Neural Networks to Detect Graph Properties
[2021-04-17 ] Fast
Null-Space Calculation via LU-Decomposition
[2021-02-24 ] Stoer-Wagner
Algorithm in Clojure
[2021-02-19 ] Calculating
Vertex Covers in Clojure
[2021-02-18 ] Listing All
Paths in a Graph
[2021-02-14 ] Strict
Dyck Words and Fibonacci Numbers
[2021-02-14 ] Kruskal’s
Algorithm in Common Lisp
[2021-02-13 ] Kruskal's
Algorithm Implemented in Clojure
[2021-02-10 ] An
integer dynamical system of integers
[2021-02-08 ] Binary
Symmetrization
[2021-01-28 ] Prüfer
Encoding and Decoding of a Tree in Clojure
[2021-01-27 ] Counting
Cycle-Free Paths in a Graph
[2021-01-27 ] Counting
Connected Labeled Graphs
[2020-12-18 ] Counting
Graphs with a Prescribed Degree Sequence
[2020-12-13 ] Havel–Hakimi
Algorithm in Clojure
[2020-12-12 ] Havel–Hakimi
Algorithm in Common-Lisp
[2020-10-23 ] The Quadratic
Casimir Element
[2020-07-04 ] Collatz Sequence
in Binary
[2020-07-02 ] A Lazy
Sequence of Primes in Clojure
[2020-06-10 ] Yet
Another Fizz-Buzz in Common Lisp
[2020-05-12 ] ECB
Data with Clojure and Vega-Lite
[2020-05-06 ] Processing
ECB Data with Common Lisp
[2020-04-17 ] Next
Permutation in the Lexicographical Ordering
[2020-04-13 ] Turkish
Hyphenation in Common Lisp
[2020-04-01 ] Using JavaPlex
with Clojure
[2019-11-05 ] Constricted
Arithmetic Progressions
[2019-11-03 ] The
Number of Arithmetic Progressions of Integers
[2019-05-06 ] Bron-Kerbosch
Algorithm in Clojure
[2019-05-01 ] An
Implementation of Ford-Fulkerson Algorithm in Clojure
[2019-04-22 ] Document
Summarization via Nonnegative Matrix Factorization
[2019-04-20 ] Latent
Semantic Analysis in Clojure
[2019-04-13 ] K-Nearest
Neighbors Algorithm in Clojure
[2019-04-06 ] K-Means
Implemented in Clojure
[2019-03-19 ] Prüfer
Encoding/Decoding of a Tree in Common Lisp
[2019-03-05 ] Gale-Shaply
Algorithm in Common Lisp
[2019-03-02 ] Calculating
The Correct Rank of a Matrix
[2018-12-04 ] Feed-forward
and back-propagation in neural networks as left- and right-fold
[2018-10-31 ] Nonnegative
Matrix Decomposition in Clojure
[2018-10-30 ] Non-negative
Matrix Decomposition in Scala
[2018-08-30 ] Working
with European Central Bank Data in Scala
[2018-07-30 ] Perverse
Sequences
[2018-05-28 ] Online
Perceptron in Common Lisp
[2018-05-26 ] Online Perceptron
[2018-05-18 ] Online Regression
[2018-05-06 ] Knut’s
Algorithm-S in Common Lisp
[2018-02-28 ] Irreducible Dyck
Words
[2018-02-19 ] Optimization
with GNU Scientific Library for Lisp
[2018-02-10 ] Van Eck’s
Sequence
[2018-02-09 ] Hiring
networks in mathematics
[2018-02-08 ] Linus Sequence
[2018-02-05 ] Egyptian
Fractions
[2018-02-01 ] Listing all
Young Tableaux
[2018-01-23 ] Collatz sequence
(yet again)
[2018-01-15 ] Hofstadter's Q
sequence
[2018-01-09 ] Farey Sequence
[2018-01-09 ] Catalan's
Triangle
[2018-01-06 ] The
Shoelace Formula for the Area of a Polygon
[2017-10-01 ] Working
with European Central Bank Data in Python
[2017-09-27 ] Expected
Value of the Diameter of a Tree
[2017-09-26 ] Using
Quandl with kixi.stats on Clojure
[2017-09-22 ] Using Quandl
with Common Lisp
[2017-08-05 ] Solving
Linear Equations in Natural Numbers
[2017-07-31 ] Transitive
Closure of a Directed Graph or a Relation
[2017-07-20 ] Steenrod-Milnor
and Tournament Sequences
[2017-07-15 ] A
lower bound on the radius of a graph
[2017-07-08 ] All partitions
of an integer
[2017-07-06 ] Some Hasse
Diagrams
[2017-07-04 ] Shuffles
[2017-07-03 ] Kaprekar Sequence
[2017-07-01 ] Lattice of Dyck
Words
[2017-06-28 ] The
poset of connected subgraphs of a connected graph
[2017-06-21 ] Calculating
the Diameter and the Radius of a Graph Using Tropic Linear
Algebra
[2017-06-19 ] Generating
random regular graphs
[2017-06-14 ] Estimating
the maximum element of a large list
[2017-06-09 ] A
Stochastic Gradient Descent Implementation in Clojure
[2017-06-06 ] A topology
problem
[2017-04-22 ] Listing duplicate
files
[2017-03-14 ] My First Idris
Proof
[2016-12-02 ] Distinguishing
hash functions (part II)
[2016-12-01 ] Distinguishing
hash functions
[2016-10-20 ] Hofstadter-Conway
$10,000 sequence
[2016-08-18 ] A
Solution for Problem 171 of 4Clojure
[2016-08-16 ] Puzzles and Group
Theory
[2016-08-13 ] Using Weka within
Lisp
[2016-07-12 ] Funniest
and Unfunniest Jokes in the Jester Dataset
[2016-07-05 ] Generating
Uniformly Random Connected Graphs
[2016-06-16 ] The
Robinson-Schensted Algorithm
[2016-06-01 ] Conjugate
Partitions
[2016-04-27 ] Using Word2Vec
from Clojure
[2016-04-24 ] Using
Word2Vec from Common Lisp
[2016-04-18 ] A Migration
Analysis
[2016-04-11 ] Basic
Data Analysis with CL without Frameworks
[2016-03-25 ] Parallel
map-reduce in Common Lisp
[2016-02-22 ] Text
Summarization and Topic Analysis
[2016-01-27 ] Set Covering
Problem
[2016-01-25 ] Kolmogorov-Smirnov
Test
[2016-01-20 ] Eigen-values
of the Laplacian and Connected Components of a Graph
[2015-12-12 ] Dual Graphs
[2015-10-26 ] Longest
Increasing Subsequence Revisited
[2015-10-16 ] Document
Summarization via Markov Chains
[2015-10-07 ] Computational
Literary Analysis
[2015-09-30 ] Library of
Babel in Common Lisp
[2015-09-28 ] Merging
Association Lists in Common Lisp
[2015-07-22 ] Cheapest
Paths via Tropic Matrices
[2015-07-21 ] Hidden
Markov Models via Tropic Matrices
[2015-07-08 ] A non-technical
post
[2015-06-28 ] An
implementation of the Viterbi algorithm in Common Lisp
[2015-05-28 ] Greatest
Common Divisor of Two Rational Numbers
[2015-05-21 ] Partitions
of Equal Measure Whatever the Measure May Be
[2015-05-14 ] Finding Cliques
in a Graph
[2015-05-12 ] Set Cover Problem
[2015-05-03 ] Threading
Macros in Common Lisp
[2015-05-03 ] Happy Numbers
[2015-05-01 ] Collatz Primes
[2015-04-23 ] Splitting Streams
[2015-04-06 ] Hamming
Distance and Double Hashing
[2015-04-05 ] Hamming
Distance and Hashing Functions
[2015-04-05 ] Hamming
Derivative of Hashing Functions
[2015-04-02 ] A Topology
Problem
[2015-03-21 ] Curve
Fitting is a Gram-Schmidt Reduction
[2015-03-08 ] Maximum
number of characters using keystrokes A, Ctrl+A, Ctrl+C and
Ctrl+V
[2015-03-06 ] Eccentricity,
Radius and Diameter in a Graph, Revisited
[2015-03-01 ] Graphs and
Entropy
[2015-02-22 ] Math PhD
Hiring Network (Part 3)
[2015-02-19 ] Math PhD
Hiring Network (Part 2)
[2015-02-18 ] Math PhD
Hiring Network (Part 1)
[2015-02-17 ] Faculty
Networks and Inequality in Hiring Practices in Universities
[2015-02-10 ] Functional
Streams in Lisp Explained
[2015-02-05 ] Collatz-type
Conjectures (Continued)
[2015-02-04 ] Collatz-type
Conjectures (Continued)
[2015-01-31 ] Collatz-type
Conjectures (Continued)
[2015-01-30 ] Collatz-type
Conjectures
[2015-01-28 ] Experiments
with Infinite Recursive Sequences (continued)
[2015-01-17 ] Experiments
with Infinite Recursive Sequences
[2015-01-10 ] Goldbach Pairs
[2015-01-02 ] Collatz Lengths
(Continued)
[2015-01-01 ] Functional
Streams
[2014-12-27 ] Polarization
in the US Congress
[2014-12-18 ] Partition a
sequence
[2014-11-28 ] Uniformly
Random Permutations
[2014-11-22 ] An
Implementation of Ford-Fulkerson Algorithm in Common Lisp
[2014-11-17 ] Tropic
Calculation of Cheapest Paths
[2014-11-05 ] Longest
common subsequence of two sequences
[2014-10-30 ] Counting
Spanning Trees of a Graph
[2014-10-26 ] Longest
Increasing Subsequence
[2014-10-24 ] The
Number of Inversions in a Sequence
[2014-10-22 ] Hashes and
Entropy
[2014-10-09 ] Estimating
Cardinality with Constant Memory Complexity
[2014-09-30 ] Landau's Function
[2014-09-29 ] A
Problem on Substitution Ciphers and Group Theory
[2014-09-28 ] A Morse Code
Translator
[2014-09-23 ] A
Memoization Macro for Common Lisp
[2014-09-21 ] Reducers are
Monoid Morphisms
[2014-09-18 ] Number
of isomorphism classes of binary trees
[2014-09-07 ] CONS is your
friend
[2014-08-22 ] A Zipf's Law
Simulation
[2014-08-07 ] Generating
Uniformly Random Trees
[2014-07-09 ] A Solution
for Project Euler #463
[2014-06-12 ] Entropy of
truncated MD5 hashing
[2014-06-08 ] Hexadecimal digits
of π
[2014-02-11 ] Information
content of n-grams
[2014-02-08 ] Turkish
Sentiment Analysis Using Thesaurus Distance
[2014-02-01 ] Sentiment
analysis using word distances
[2014-01-27 ] Phase
transitions in entropy
[2013-12-13 ] Optimal length of
n-grams
[2013-12-10 ] Counting
strings that contain intervals of same letter repetitions
[2013-12-02 ] Patterns
Separating Large Texts
[2013-11-23 ] Collatz
Sequences (Continued)
[2013-11-11 ] Entropy
and approximately one-to-one maps
[2013-10-23 ] Tree Isomorphism
[2013-10-15 ] Self Organizing
Maps
[2013-09-15 ] Euler Project
#401
[2013-09-15 ] An
additively recursive definition of the Moebius function
[2013-09-11 ] An
Unsuccessful Attempt for Solving Euler Project #401
[2013-09-04 ] Uniform
Sampling from Parametrized Submanifolds in Scala
[2013-09-04 ] Uniform
Sampling from Parametrized Submanifolds
[2013-08-30 ] Randomly
Generated Points Obeying a Distribution
[2013-08-25 ] Simulated
Annealing in Lisp
[2013-08-21 ] Eigenvalues
and Eigenvectors in GSLL
[2013-08-16 ] Reservoir
Sampling
[2013-08-11 ] Gibbs
sampling in lisp compared with C
[2013-08-10 ] Logistic
Regression in lisp
[2013-08-10 ] Linear
Discriminant Analysis in R
[2013-07-17 ] A
Gradient Descent Implementation in Lisp
[2013-07-01 ] k-Nearest
Neighbor Classification Algorithm Implemented in Lisp
[2013-05-19 ] Newton-Raphson
Method
[2013-05-07 ] Levenshtein
Distance
[2013-04-15 ] Cut points in a
graph
[2013-04-01 ] Experiments
on Collatz Lengths (Continued)
[2013-02-18 ] The
sound of the torsion parts of homotopy groups of spheres
[2013-02-12 ] Monadic Units
[2013-02-07 ] Distribution
of Collatz Lengths (continued)
[2013-02-03 ] Distribution
of Collatz Lengths
[2013-01-31 ] Quotients
of polynomial algebras
[2013-01-12 ] Path ideals
[2013-01-10 ] McCarthy91
Terminates
[2013-01-09 ] Finding
all paths in a directed graph
[2013-01-04 ] A
Simple Monte-Carlo Integration Implementation in Lisp
[2012-12-30 ] A
simple problem in Kolmogorov-Chaitin complexity
[2012-12-29 ] From walks to
paths
[2012-12-16 ] Higher
order functions, functors and monads
[2012-12-13 ] Eccentricity,
Radius and Diameter in an Undirected Graph
[2012-11-29 ] Untitled
[2012-11-25 ] Strictly
Increasing Labels of Directed Graphs
[2012-11-19 ] Strictly
Increasing Labellings of Directed Graphs
[2012-11-17 ] Nilpotent
elements in an artinian algebra
[2012-11-04 ] Local
rings, idempotents and non-invertible elements
[2012-10-18 ] An
implementation of the fixed-radius near neighbor clustering algorithm in
lisp
[2012-10-15 ] Reducing directed
graphs
[2012-10-10 ] An
implementation of the k -means
clustering algorithm in lisp
[2012-10-08 ] A
comparison of different map functions in lisp
[2012-10-03 ] Source code
entropy
[2012-09-28 ] Collisions in
random walks
[2012-09-26 ] Transitive
closure of a directed graph
[2012-09-26 ] Solving
linear equations in ℕ
[2012-09-26 ] Listing
partitions
[2012-09-26 ] Inverting
formal power series
[2012-09-26 ] Hasse
subgraph of a directed graph