Score: 0

Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

Published: January 12, 2026 | arXiv ID: 2601.07695v1

By: Siwen Jiao , Tianxiong Lv , Kangan Qian and more

Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.

Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward

Machine Learning (CS)

Teaches computers to solve math problems better.

8 Jan 2026 2

89%

Puzzle Curriculum GRPO for Vision-Centric Reasoning

CV and Pattern Recognition

Teaches computers to reason better without human help.

16 Dec 2025 3

89%

SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization

CV and Pattern Recognition

Teaches computers to understand where things are.

2 Jun 2025 0

View PDF Login to Bookmark

Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model

Technical Abstract

Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward

Puzzle Curriculum GRPO for Vision-Centric Reasoning

SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization