HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding
By: Qitan Lv , Tianyu Liu , Wen Wu and more
Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.
Similar Papers
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Computation and Language
Makes AI write much faster without retraining.
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Computation and Language
Makes AI talk faster by guessing words.
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
CV and Pattern Recognition
Makes videos understandable much faster.