ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better
By: Yuan Zhang , Ming Lu , Junwen Pan and more
Potential Business Impact:
Makes AI think smarter and faster with pictures.
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.
Similar Papers
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
CV and Pattern Recognition
Makes videos show cause and effect better.
Rethinking Chain-of-Thought Reasoning for Videos
CV and Pattern Recognition
Makes AI understand videos faster with less data.
Video Finetuning Improves Reasoning Between Frames
CV and Pattern Recognition
Helps computers understand video stories better.