Parallel Token Prediction for Language Models
By: Felix Draxler , Justus Will , Farrin Marouf Sofian and more
Potential Business Impact:
Makes computers write sentences much faster.
We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
Similar Papers
LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model
Computation and Language
Makes AI understand long texts much faster.
FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction
Machine Learning (CS)
Makes AI write much faster without mistakes.
Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference
CV and Pattern Recognition
Makes AI understand pictures much faster.