Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy
By: Shujian Gao , Yuan Wang , Jiangtao Yan and more
Potential Business Impact:
Helps AI see and understand pictures, not just words.
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textit{blind reasoners}, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbf{Thinking with Deltas}, a framework driven by a \textbf{Differential Visual Reasoning Policy (DVRP)}. DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textit{visual sensitivity}) while minimizing divergence from perturbed inputs (ensuring \textit{visual robustness}). By aligning reasoning variations strictly with the \textit{Delta} of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.
Similar Papers
From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning
CV and Pattern Recognition
Helps AI see and think better to solve puzzles.
More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models
CV and Pattern Recognition
Makes AI better at seeing and thinking.
From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
CV and Pattern Recognition
Teaches computers to "think" with pictures.