Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling
By: Chenlei Gong , Yuanhe Tian , Lei Mao and more
Potential Business Impact:
Makes computers understand DNA better for science.
Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.
Similar Papers
Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
Computation and Language
Helps computers understand DNA better.
When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes
Genomics
Helps computers understand animal DNA better.
SeqPE: Transformer with Sequential Position Encoding
Machine Learning (CS)
Helps AI understand longer texts and images.