Limits and Gains of Test-Time Scaling in Vision-Language Reasoning
By: Mohammadjavad Ahmadpour , Amirmahdi Meighani , Payam Taebi and more
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.
Similar Papers
The Art of Scaling Test-Time Compute for Large Language Models
Computation and Language
Makes AI think better by changing how it works.
Test-Time Scaling of Reasoning Models for Machine Translation
Computation and Language
Makes computer translators better at fixing their own mistakes.
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
CV and Pattern Recognition
Makes AI better at understanding videos by looking closer.