Which Regular Expression Patterns are Hard to Match?

Abstract

Regular expressions are a fundamental notion in formal language theory, and are frequently used in computer science to define search patterns. In particular, regular expression matching is a widely used computational primitive, employed in many programming languages and text processing utilities. A classic algorithm for regular expression matching constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an $O(m n)$ running time (where $m$ is the length of the pattern and $n$ is the length of the text). This running time can be improved slightly (by a logarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wild-card matching, subset matching, etc.

In this paper we show that the complexity of regular expression matching can be characterized based on its *depth* (when interpreted as a formula). Very roughly, our results state that for expressions involving concatenation, or, and "Kleene plus", the following dichotomy holds:

(a) Matching regular expressions of depth two (involving any combination of the above operators) can be solved in near-linear time. In particular, this case covers the aforementioned variants of regular expression matching amenable to fast algorithms, and introduces new ones.

(b) Matching regular expressions of depth three (involving any combination of the above operators) that are not reducible to some depth-two expressions cannot be solved in sub-quadratic time unless Strong Exponential Time Hypothesis is false.

Attachment

Which Regular Expression Patterns are Hard to Match?

Which Regular Expression Patterns are Hard to Match?

Abstract

Attachment

Video Recording