Score: 0

Compressed Dictionary Matching on Run-Length Encoded Strings

Published: September 3, 2025 | arXiv ID: 2509.03265v1

By: Philip Bille , Inge Li Gørtz , Simon J. Puglisi and more

Potential Business Impact:

Finds words in zipped text faster.

Business Areas:
Text Analytics Data and Analytics, Software

Given a set of pattern strings $\mathcal{P}=\{P_1, P_2,\ldots P_k\}$ and a text string $S$, the classic dictionary matching problem is to report all occurrences of each pattern in $S$. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let $m$ and $n$ be the total length of the patterns $\mathcal{P}$ and the length of the text string $S$, respectively, and let $\overline{m}$ and $\overline{n}$ be the total number of runs in the run-length encoding of the patterns in $\mathcal{P}$ and $S$, respectively. Our main result is an algorithm that achieves $O( (\overline{m} + \overline{n})\log \log m + \mathrm{occ})$ expected time, and $O(\overline{m})$ space, where $\mathrm{occ}$ is the total number of occurrences of patterns in $S$. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an $\log \log m$ factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.

Page Count
16 pages

Category
Computer Science:
Data Structures and Algorithms