The Kitchen Sink and Other Oddities

Atabey Kaygun

The Depth Limit: Why Regular Expressions Cannot Parse Arbitrary Trees

It is a well-known adage in computer science that one should not use regular expressions to parse HTML or XML. This rule stems from a fundamental limitation in formal language theory: classical regular expressions are mathematically incapable of parsing nested tree structures of arbitrary depth. Whether one is extracting metadata from structured archives or tokenizing source code, tree structures demand the ability to match hierarchical boundaries, such as opening and closing parentheses or nested tags. Pure regular expressions possess only a finite amount of internal memory. Consequently, they can only track nesting up to a strictly fixed, pre-determined depth before their state space is entirely exhausted, making the traversal of arbitrarily deep trees impossible.

Regular Expressions in Practice and the “Infinite Depth” of PCRE

In practical text processing, regular expressions are the engine behind ubiquitous Unix utilities like grep, sed, and awk. These tools excel at pattern matching over flat sequences. For instance, using awk to extract categorized records might look like

/^[A-Z][a-z]+:/ { print $1 }

or using sed to isolate a specific date format

sed -E 's/([0-9]{4})-([0-9]{2})/\1/g'

These extended regular expressions (ERE) add syntactic convenience—such as the + quantifier or grouping parentheses—but they do not fundamentally alter the underlying computational power; they remain strictly bound to finite memory.

However, the landscape shifts dramatically with the introduction of Perl-Compatible Regular Expressions (PCRE). Modern PCRE engines include features that mathematically disqualify them from being “regular.” The most critical of these is recursion. A pattern such as

(\((?:[^()]+|(?1))*\))

utilizes (?1) to recursively call the first capture group. This mechanism implicitly utilizes a call stack. By introducing a pushdown memory structure, PCRE implementations transcend the boundaries of finite state machines. This stack allows the engine to suspend its current state, descend into a nested branch of a tree, and resume upon returning, effectively granting it the “infinite depth” necessary to parse truly context-free or even context-sensitive structures.

The Formal Hierarchy of Regular Languages

To understand this limitation formally, we must look at how regular languages are constructed. Let \(\Sigma\) be a finite alphabet. A regular language over \(\Sigma\) is defined recursively as either the empty set, a single symbol from \(\Sigma\), or the result of applying three operations: union (\(A \cup B\)), concatenation (\(AB\)), and the Kleene star (\(A^*\)). The fundamental result connecting these algebraic expressions to computation is Kleene’s Theorem, which states that a language is generated by a regular expression if and only if it is recognized by a Finite-State Automaton (FSA).

The inability of an FSA to parse arbitrary trees is neatly captured by the Pumping Lemma for regular languages. Consider the simplest abstraction of a tree: a set of perfectly balanced parentheses, represented as the language \(L = \{ (^n )^n \mid n \ge 0 \}\). The Pumping Lemma asserts that for any regular language, there is a pumping length \(p\) such that any sufficiently long string can be divided into three parts, \(s = xyz\), where the middle portion \(y\) can be repeated infinitely (\(xy^nz\)) while remaining in the language. If an FSA attempts to parse a tree where the depth exceeds \(p\), the automaton will inevitably be forced into a loop while reading the opening brackets. Pumping this loop produces strings with mismatched boundaries (e.g., \((^{n+k} )^n\)), proving that the set of all trees cannot be a regular language.

Finite Automata and Finite Trees

While arbitrary depth is impossible, regular expressions can perfectly parse trees if the maximum depth is bounded by a finite integer \(N\). A tree of depth at most \(N\) over a finite set of tags constitutes a strictly finite language. Because every finite language is inherently regular, we can construct a Deterministic Finite Automaton (DFA) to recognize it. By the Myhill-Nerode theorem, the required memory is determined by the number of equivalence classes of the language’s prefix relation. To parse up to depth \(N\), the automaton requires a state space that scales at least linearly with \(N\) to “count” the current depth level, effectively unrolling the recursion into a finite, acyclic sequence of states.

The Mathematical Trade-off: Power Versus Decidability

Ultimately, the inability of regular expressions to parse arbitrary trees should be viewed not as a flaw, but as an elegant trade-off inherent in formal language theory: we sacrifice expressive power to gain absolute computational certainty. Because classical regular expressions are bounded by finite memory, they guarantee linear-time execution and immediate termination. While finite automata cannot handle infinite recursion, their strict mathematical boundaries make them the perfect, safest choice for high-speed lexical analysis. In this light, the depth limit is a necessary structural constraint that keeps our foundational search tools blindingly fast and mathematically decidable, ensuring that the machine will always yield an answer without falling into the abyss of an infinite loop.