Score: 0

The Disparate Impacts of Speculative Decoding

Published: October 2, 2025 | arXiv ID: 2510.02128v1

By: Jameson Sandler , Ahmet Üstün , Marco Romanelli and more

Potential Business Impact:

Makes AI answer questions faster, fairly.

Business Areas:
Predictive Analytics Artificial Intelligence, Data and Analytics, Software

The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

Country of Origin
🇺🇸 United States

Page Count
17 pages

Category
Computer Science:
Computation and Language