Score: 1

Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning

Published: October 10, 2025 | arXiv ID: 2510.08899v1

By: Junxi Yin , Haisen Luo , Zhenyu Li and more

Potential Business Impact:

Teaches AI to solve hard math problems better.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.

CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Machine Learning (CS)

Boosts AI thinking with step-by-step feedback

4 Aug 2025 2

90%

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI think better and avoid mistakes.

26 Nov 2025 1

90%

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI think better and learn from mistakes.

26 Nov 2025 1

View PDF Login to Bookmark

Page Count

12 pages

Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning

Teaches AI to solve hard math problems better.

Technical Abstract

CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning