Score: 1

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Published: November 8, 2025 | arXiv ID: 2511.06134v1

By: Wei Yang , Jiacheng Pang , Shixuan Li and more

Potential Business Impact:

Helps AI teams solve harder problems better.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Multi-agent systems (MAS) built on Large Language Models (LLMs) are being used to approach complex problems and can surpass single model inference. However, their success hinges on navigating a fundamental cognitive tension: the need to balance broad, divergent exploration of the solution space with a principled, convergent synthesis to the optimal solution. Existing paradigms often struggle to manage this duality, leading to premature consensus, error propagation, and a critical credit assignment problem that fails to distinguish between genuine reasoning and superficially plausible arguments. To resolve this core challenge, we propose the Multi-Agent Exploration-Synthesis framework Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples these cognitive modes. Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis. To operationalize this critical synthesis phase, we introduce Conditional Listwise Policy Optimization (CLPO), a reinforcement learning objective that disentangles signals for strategic decisions and tactical rationales. By combining decision-focused policy gradients with a list-wise ranking loss over justifications, CLPO achieves clean credit assignment and stronger comparative supervision. Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches, delivering absolute accuracy gains of 6% on average and up to 10% at best.

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

Machine Learning (CS)

Teaches AI to control traffic better using smart lessons.

24 Nov 2025 1

89%

Reinforcement Learning-Augmented LLM Agents for Collaborative Decision Making and Performance Optimization

Artificial Intelligence

Helps AI teams work together to finish tasks faster.

31 Dec 2025 0

89%

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Machine Learning (CS)

Helps AI balance different goals when writing.

12 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

20 pages

Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs

Helps AI teams solve harder problems better.

Technical Abstract

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

Reinforcement Learning-Augmented LLM Agents for Collaborative Decision Making and Performance Optimization

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization