Breaking the Encoder Barrier for Seamless Video-Language Understanding
By: Handong Li , Yiyuan Zhang , Longteng Guo and more
Potential Business Impact:
Makes videos understandable by computers much faster.
Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.
Similar Papers
Unifying Specialized Visual Encoders for Video Language Models
CV and Pattern Recognition
Lets computers understand videos much better.
Empowering Agentic Video Analytics Systems with Video Language Models
CV and Pattern Recognition
Lets computers understand very long videos.
Kwai Keye-VL 1.5 Technical Report
CV and Pattern Recognition
Helps computers understand videos better and longer.