Score: 0

GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Published: January 11, 2026 | arXiv ID: 2601.06795v1

By: Zhengqing Yan , Xinyang Liu , Yi Zhang and more

Potential Business Impact:

Teaches computers to prove math problems faster.

Business Areas:

A/B Testing Data and Analytics

Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

Artificial Intelligence

Teaches computers to think better and solve problems.

8 Dec 2025 1

90%

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Computation and Language

Makes AI learn many things at once better.

8 Jan 2026 1

89%

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

Artificial Intelligence

Teaches computers to think better step-by-step.

29 Jul 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Teaches computers to prove math problems faster.

Technical Abstract

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity