ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
By: Ziyu Wan , Yunxiang Li , Xiaoyu Wen and more
Potential Business Impact:
Teaches computers to think about their thinking.
Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public
Similar Papers
Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey
Artificial Intelligence
Makes AI think about its own thinking better.
Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy
CV and Pattern Recognition
Helps robots work together on hard jobs.
LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation
Robotics
Robots learn tasks faster with AI help.