DNACHUNKER: Learnable Tokenization for DNA Language Models
By: Taewon Kim , Jihwan Shin , Hyomin Kim and more
Potential Business Impact:
Helps read DNA better, even with changes.
DNA language models have emerged as powerful tools for decoding the complex language of DNA sequences. However, the performance of these models is heavily affected by their tokenization strategy, i.e., a method used to parse DNA sequences into a shorter sequence of chunks. In this work, we propose DNACHUNKER, which integrates a learnable dynamic DNA tokenization mechanism and is trained as a masked language model. Adopting the dynamic chunking procedure proposed by H-Net, our model learns to segment sequences into variable-length chunks. This dynamic chunking offers two key advantages: it's resilient to shifts and mutations in the DNA, and it allocates more detail to important functional areas. We demonstrate the performance of DNACHUNKER by training it on the human reference genome (HG38) and testing it on the Nucleotide Transformer and Genomic benchmarks. Further ablative experiments reveal that DNACHUNKER learns tokenization that grasps biological grammar and uses smaller chunks to preserve detail in important functional elements such as promoters and exons, while using larger chunks for repetitive, redundant regions.
Similar Papers
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Machine Learning (CS)
Computers learn language better from raw text.
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Computation and Language
Teaches computers to understand languages better.
DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units
Machine Learning (CS)
Helps computers understand DNA better.