Score: 0

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Published: January 13, 2026 | arXiv ID: 2601.08273v1

By: Qitan Lv , Tianyu Liu , Wen Wu and more

Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Computation and Language

Makes AI write much faster without retraining.

31 Jan 2025 1

90%

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Computation and Language

Makes AI talk faster by guessing words.

8 Feb 2025 1

89%

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

CV and Pattern Recognition

Makes videos understandable much faster.

22 Aug 2025 1

View PDF Login to Bookmark

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Technical Abstract

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning