Speculative Sampling via Exponential Races
By: Szymon Kobus, Deniz Gündüz
Potential Business Impact:
Makes AI write faster by guessing ahead.
Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative decoding. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens $k$ generated by the draft model for large $k$, which serves as an upper bound for all $k$. We also propose a novel speculative decoding method via exponential race ERSD that matches state-of-the-art performance.
Similar Papers
Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks
Computation and Language
Makes AI write much faster by checking many words at once.
The Disparate Impacts of Speculative Decoding
Computation and Language
Makes AI answer questions faster, fairly.
Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
Distributed, Parallel, and Cluster Computing
Makes AI talk much faster by guessing ahead.