Score: 0

Decomposing Words for Enhanced Compression: Exploring the Number of Runs in the Extended Burrows-Wheeler Transform

Published: June 5, 2025 | arXiv ID: 2506.04926v1

By: Florian Ingels, Anaïs Denis, Bastien Cazaux

Potential Business Impact:

Finds better ways to shrink computer files.

Business Areas:
Big Data Data and Analytics

The Burrows-Wheeler Transform (BWT) is a fundamental component in many data structures for text indexing and compression, widely used in areas such as bioinformatics and information retrieval. The extended BWT (eBWT) generalizes the classical BWT to multisets of strings, providing a flexible framework that captures many BWT-like constructions. Several known variants of the BWT can be viewed as instances of the eBWT applied to specific decompositions of a word. A central property of the BWT, essential for its compressibility, is the number of maximal ranges of equal letters, named runs. In this article, we explore how different decompositions of a word impact the number of runs in the resulting eBWT. First, we show that the number of decompositions of a word is exponential, even under minimal constraints on the size of the subsets in the decomposition. Second, we present an infinite family of words for which the ratio of the number of runs between the worst and best decompositions is unbounded, under the same minimal constraints. These results illustrate the potential cost of decomposition choices in eBWT-based compression and underline the challenges in optimizing run-length encoding in generalized BWT frameworks.

Page Count
15 pages

Category
Computer Science:
Data Structures and Algorithms