Score: 0

Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning in LLMs

Published: August 25, 2025 | arXiv ID: 2508.17850v1

By: Han Zhang , Ruibin Zheng , Zexuan Yi and more

Potential Business Impact:

Makes AI learn better even with slow internet.

Business Areas:

A/B Testing Data and Analytics

As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

Machine Learning (CS)

Trains smart computer programs far apart.

25 Aug 2025 1

90%

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems

Machine Learning (CS)

Teaches AI groups to work better, faster.

3 Jun 2025 1

90%

ESPO: Entropy Importance Sampling Policy Optimization

Machine Learning (CS)

Makes AI better at solving math problems.

29 Nov 2025 1

View PDF Login to Bookmark

Page Count

17 pages

Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning in LLMs

Makes AI learn better even with slow internet.

Technical Abstract

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems

ESPO: Entropy Importance Sampling Policy Optimization