Video Finetuning Improves Reasoning Between Frames
By: Ruiqi Yang , Tian Yun , Zihan Wang and more
Potential Business Impact:
Helps computers understand video stories better.
Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.
Similar Papers
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
CV and Pattern Recognition
Helps computers understand videos by looking at frames.
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning
CV and Pattern Recognition
Helps AI "see" and "think" better with videos.
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
CV and Pattern Recognition
Helps computers understand videos by watching them.