The Kitchen Sink and Other Oddities

Atabey Kaygun

Counting strings that contain intervals of same letter repetitions

Description of the problem

Today, I will look at a simple combinatorics problem that came up in my discrete math class.

Assume we have an alphabet \(a_1,\ldots,a_r\) of size \(r\). How many different strings of length \(n\) can I construct in which the intervals of same letter repetitions have length at most \(k\)?

For example if \(n=5\) and \(k=3\) the strings “ababa” “aabaa” “aabba” are ok but “aaaab” is not because the upper limit for lengths of intervals of same letter repetitions was \(3\).

A solution

First, I need to determine the shape of the string. For that I need to count the number of solutions for the equation \[ x_1 + \cdots + x_m = n \quad\text{ with }\quad 0< x_i\leq k \] The number of solutions of this equation counts the number of strings of the shape \[ \underbrace{a_1\cdots a_1}_{\text{$x_1$-times}}\cdots \underbrace{a_m\cdots a_m}_{\text{$x_m$-times}} \] where \(a_i\neq a_{i+1}\). If \(N(n,m,k)\) is the number of solutions for the equation above, then the number of strings of this shape is \[ N(n,m,k)\cdot r(r-1)^{m-1} \] Then the number of strings I want is \[ \sum_{m=1}^n N(n,m,k)\cdot r(r-1)^{m-1} \]

I will describe the function \(N(n,m,k)\) recursively. We have a simple recurrence relation of the form \[ N(n,m,k) = \sum_{i=0}^m \binom{m}{i} N(n-m,m-i,k-1) \] with the base cases \[ N(n,m,k) = \begin{cases} 0 & \text{ if } km < n\\ 1 & \text{ if $m=n$ or $km=n$}\\ 0 & \text{ if $m=1$ and $k < n$}\\ 1 & \text{ if $m=1$ and $k\geq n$} \end{cases} \] The recursive equation comes from the following argument: in the equation \[ x_1 + \cdots + x_m = n \quad\text{ with }\quad 1\leq x_i\leq k \] if we reduce every term by one, the upper bound \(k\) is also reduced by one and we get \[ y_1 + \cdots + y_m = n-m \quad\text{ with }\quad 0\leq y_i\leq k-1 \] Now, we have to control the terms which are equal to \(0\) and we go over all possibilities in which \(y_j\)’s are 0 in the sum.

There is also a functional relation of the form \[ N(n,m,k) = N(m(k+1)-n,m,k) \] because the set of solutions to the equation \[ x_1 + \cdots + x_m = n \] with the constraint \(1\leq x_i\leq k\) is bijective to the set of solutions of the equation \[ z_1 + \cdots + z_m = mk - n \] with the constraint \(0\leq z_i\leq k-1\) since every \(x_i\) in the first equation has a unique conjugate \(z_i = k-x_i\) in the second equation. By increasing the elements \(z_i\) by 1 we get \[ y_1 + \cdots + y_m = m(k+1) - n\] with the constraint \(1\leq y_i\leq k\).

An implementation

I will start by implementing a function for the binomial numbers.

(defun binom (n k)
   (cond ((< (- n k) k) (binom n (- n k)))
         ((= k 0) 1)
         ((= k 1) n)
         (t (reduce '* (loop for i from 0 to (1- k)
                             collect (/ (- n i) 
                                        (- k i)))))))

BINOM

Here is a simple implementation for the number of solutions for the integer-linear equation I was considering earlier with built-in memoization:

(let ((lookup (make-hash-table :test 'equal)))
   (defun solutions (n m k)
      (cond ((or (<= k 0) (<= m 0) (<= n 0) (> n (* m k))) 0)
            ((= m 1) (if (< k n) 0 1))
            ((= m n) 1)
            ((= n (* m k)) 1)
            ((> n (- (* m (1+ k)) n)) (solutions (- (* m (1+ k)) n) m k))
            (t (or (gethash (list n m k) lookup)
                   (setf (gethash (list n m k) lookup)
                         (loop for i from 0 to (1- m)
                               sum (* (binom m i) 
                                      (solutions (- n m) (- m i) (1- k))))))))))

SOLUTIONS

And finally the number of words

(defun count-solutions (n k r)
   (loop for m from (truncate (/ n k)) to n
         sum (* r
                (solutions n m k) 
                (expt (1- r) (1- m)))))

COUNT-SOLUTIONS

The following is a test function.

(defun test-solutions (a b)
   (loop for i from 1 to a collect (count-solutions a i b)))

TEST-SOLUTIONS

For \(n=3\) the list of bit strings of length 3 are

101 010
001 110 100 011
000 111

grouped together according to the largest repeating intervals. For \(n=4\) the list is

1010 0101
0010 1101 0100 1011 0011 1100 0110 1001
0001 1110 1000 0111
0000 1111

And finally for \(n=5\)

10101 01010
00101 11010 10100 01011 00110 11001 01100 10011 10010 01101 01001 10110 00100 11011 
00010 11101 01000 10111 10001 01110 00111 11000 11100 00011
00001 11110 10000 01111
00000 11111

The numbers below confirm these results.

(test-solutions 3 2)

(2 6 8)

(test-solutions 4 2)

(2 10 14 16)

(test-solutions 5 2)

(2 16 26 30 32)

The following is the number of strings of length 50 using only the letters ‘a’, 'c’, 'g’ and ‘t’ that contain same letter intervals of length at most 5.

(count-solutions 50 5 4)

1226096552838244254073677630864

This is approximately 96.72 percent of all words of length 50 using the alphabet “a,c,g,t”,