Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
By: Yuxuan Zhou , Fei Huang , Heng Li and more
Potential Business Impact:
Makes AI write faster without making mistakes.
Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.
Similar Papers
Fast Inference via Hierarchical Speculative Decoding
Machine Learning (CS)
Makes AI write stories much faster.
Fast Inference via Hierarchical Speculative Decoding
Machine Learning (CS)
Makes AI write faster by checking its work.
Hierarchical Verification of Speculative Beams for Accelerating LLM Inference
Computation and Language
Makes AI write faster and use less power.