Fast and memory-efficient BWT construction of repetitive texts using Lyndon grammars
By: Jannik Olbrich
Potential Business Impact:
Finds patterns in huge data faster.
The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.
Similar Papers
Decomposing Words for Enhanced Compression: Exploring the Number of Runs in the Extended Burrows-Wheeler Transform
Data Structures and Algorithms
Finds better ways to shrink computer files.
Merging RLBWTs adaptively
Data Structures and Algorithms
Merges compressed text data much faster.
Inverting Parameterized Burrows-Wheeler Transform
Data Structures and Algorithms
Lets computers find patterns with missing letters.