Score: 1

Fast and memory-efficient BWT construction of repetitive texts using Lyndon grammars

Published: April 27, 2025 | arXiv ID: 2504.19123v2

By: Jannik Olbrich

Potential Business Impact:

Finds patterns in huge data faster.

Business Areas:

Text Analytics Data and Analytics, Software

The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.