GQ-VAE: A gated quantized VAE for learning variable length tokens
By: Theo Datta , Kayla Huang , Sham Kakade and more
While most frontier models still use deterministic frequency-based tokenization algorithms such as byte-pair encoding (BPE), there has been significant recent work to design learned neural tokenizers. However, these schemes generally add to underlying language model complexity and force large changes to architecture, making them hard to implement at large scales. To overcome these challenges, we propose the gated quantized variational autoencoder (GQ-VAE), a novel architecture that can be independently pre-trained to serve as a drop-in replacement for existing tokenizers. The key innovation of the architecture is to learn to encode variable-length discrete tokens. GQ-VAE improves compression and language modeling performance over a standard VQ-VAE tokenizer, and approaches the compression rate and language modeling performance of BPE. Interestingly, if we use BPE with a smaller vocabulary, such that the compression is equivalent between GQ-VAE and BPE, we find that GQ-VAE improves downstream language model learning. We conclude with a discussion of several exciting avenues for future work. Code can be found at https://github.com/Theo-Datta-115/gq-vae.
Similar Papers
Vector Quantization using Gaussian Variational Autoencoder
Machine Learning (CS)
Makes images easier for computers to understand.
MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization
CV and Pattern Recognition
Makes pictures look clearer and more real.
VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling
CV and Pattern Recognition
Makes AI create better, more realistic pictures.