Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning
By: Haidong Huang , Haiyue Zhu. Jiayu Song , Xixin Zhao and more
Potential Business Impact:
Teaches robots new skills safely and quickly.
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
Similar Papers
EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
Computation and Language
Helps AI learn new things by forgetting and trying again.
URPO: A Unified Reward & Policy Optimization Framework for Large Language Models
CV and Pattern Recognition
Makes AI smarter by learning and judging at once.
Evolutionary Policy Optimization
Machine Learning (CS)
Teaches robots to learn faster and better.