TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
By: Aditya Sridhar , Nish Sinnadurai , Sean Lie and more
Potential Business Impact:
Makes AI talk faster by guessing words smartly.
Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
Similar Papers
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs
Machine Learning (CS)
Makes AI write faster and smarter.
Confidence-Modulated Speculative Decoding for Large Language Models
Computation and Language
Makes AI write faster and smarter.
Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization
Machine Learning (CS)
Makes AI write faster without losing quality.