Score: 2

Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

Published: November 16, 2025 | arXiv ID: 2511.12779v1

By: Zhenshuo Zhang , Minxuan Duan , Youran Ye and more

Potential Business Impact:

Groups similar robot tasks for faster learning.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k \ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in applications such as robotics, control, and preference optimization in language models, where learning a single policy for all $n$ objectives is suboptimal as $n$ grows. We introduce a two-stage procedure -- meta-training followed by fine-tuning -- to address this problem. We first learn a meta-policy for all objectives using multitask learning. Then, we adapt the meta-policy to multiple randomly sampled subsets of objectives. The adaptation step leverages a first-order approximation property of well-trained policy networks, which is empirically verified to be accurate within a $2\%$ error margin across various RL environments. The resulting algorithm, PolicyGradEx, efficiently estimates an aggregate task-affinity score matrix given a policy evaluation algorithm. Based on the estimated affinity score matrix, we cluster the $n$ objectives into $k$ groups by maximizing the intra-cluster affinity scores. Experiments on three robotic control and the Meta-World benchmarks demonstrate that our approach outperforms state-of-the-art baselines by $16\%$ on average, while delivering up to $26\times$ faster speedup relative to performing full training to obtain the clusters. Ablation studies validate each component of our approach. For instance, compared with random grouping and gradient-similarity-based grouping, our loss-based clustering yields an improvement of $19\%$. Finally, we analyze the generalization error of policy networks by measuring the Hessian trace of the loss surface, which gives non-vacuous measures relative to the observed generalization errors.

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Machine Learning (CS)

Teaches AI to balance many goals at once.

14 Sep 2025 1

89%

Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

Robotics

Robots learn to do many jobs better.

18 Sep 2025 1

89%

Model-Agnostic Meta-Policy Optimization via Zeroth-Order Estimation: A Linear Quadratic Regulator Perspective

Systems and Control

Teaches robots to learn new tasks faster.

1 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

17 pages

Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

Groups similar robot tasks for faster learning.

Technical Abstract

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

Model-Agnostic Meta-Policy Optimization via Zeroth-Order Estimation: A Linear Quadratic Regulator Perspective