Score: 1

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Published: October 9, 2025 | arXiv ID: 2510.08818v1

By: Yiyang Huang, Yizhou Wang, Yun Fu

Potential Business Impact:

Helps computers understand long videos better.

Business Areas:

Image Recognition Data and Analytics, Software

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

CV and Pattern Recognition

Makes AI understand pictures and words better, faster.

13 Aug 2025 1

89%

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

CV and Pattern Recognition

Makes computers understand pictures much faster.

8 Aug 2025 0

89%

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

CV and Pattern Recognition

Makes AI understand pictures much faster.

8 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

14 pages

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Helps computers understand long videos better.

Technical Abstract

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models