AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference
By: Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu
Potential Business Impact:
Makes AI talk faster without losing its smarts.
Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49\% speedup over standard speculative decoding while limiting accuracy degradation to under 2\%, making it a practical solution for efficient and adaptive LLM inference.
Similar Papers
Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput
Distributed, Parallel, and Cluster Computing
Makes AI talk faster when shared.
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Machine Learning (CS)
Makes AI talk faster on many devices.
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Machine Learning (CS)
Makes AI talk faster on many devices.