The Disparate Impacts of Speculative Decoding
By: Jameson Sandler , Ahmet Üstün , Marco Romanelli and more
Potential Business Impact:
Makes AI answer questions faster, fairly.
The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
Similar Papers
Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
Computation and Language
Makes AI think faster by skipping repeated steps.
Confidence-Modulated Speculative Decoding for Large Language Models
Computation and Language
Makes AI write faster and smarter.
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Computation and Language
Makes AI write faster and better for different jobs.