GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
By: Hongze Tan, Jianfei Pan
Potential Business Impact:
Makes AI better at thinking step-by-step.
Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.
Similar Papers
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Computation and Language
Makes AI better at thinking step-by-step.
Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
Computation and Language
Teaches computers to solve math problems better.
GTPO: Trajectory-Based Policy Optimization in Large Language Models
Machine Learning (CS)
Makes AI smarter by fixing its mistakes.