Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood
By: Xingyu Lin , Yilin Wen , En Wang and more
Potential Business Impact:
Teaches computers to solve math problems better.
Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
Similar Papers
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Computation and Language
Makes AI better at thinking step-by-step.
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
Computation and Language
Makes AI better at thinking step-by-step.
GTPO: Trajectory-Based Policy Optimization in Large Language Models
Machine Learning (CS)
Makes AI smarter by fixing its mistakes.