Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
By: Xinyu Tang , Yuliang Zhan , Zhixun Li and more
Potential Business Impact:
Makes AI better at thinking by changing how it learns.
Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.
Similar Papers
Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization
Computation and Language
Helps AI learn to solve harder problems faster.
Dissecting Long Reasoning Models: An Empirical Study
Machine Learning (CS)
Makes AI better at understanding tricky problems.
Agentic Reinforced Policy Optimization
Machine Learning (CS)
Teaches AI to use tools better in conversations.