BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
By: Sander Land, Catherine Arnett
Potential Business Impact:
Helps computers understand all languages better.
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.
Similar Papers
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
Computation and Language
Makes computers understand words better by merging them.
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Computation and Language
Teaches computers to understand word parts better.
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
Computation and Language
Helps computers understand Chinese words better.