Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE
By: Liu Zai
Potential Business Impact:
Makes AI understand words much faster.
Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. Designed with performance and safety in mind, Peek2 is Regex-free and delivers a $ 1.11\times $ improvement in overall throughput across the entire Byte-level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity $ O(n) $, and provides presegmentation results identical to those of the original Regex-based pretokenizer.
Similar Papers
BlockBPE: Parallel BPE Tokenization
Computation and Language
Makes AI understand words much faster.
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
Computation and Language
Makes computers understand words better by merging them.
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Computation and Language
Helps computers understand all languages better.