Score: 3

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

Published: May 20, 2025 | arXiv ID: 2505.14069v2

By: Wenlin Zhang , Xiangyang Li , Kuicai Dong and more

BigTech Affiliations: Huawei

Potential Business Impact:

Helps AI learn to find and use information better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.

SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

Computation and Language

Makes AI smarter by checking facts before answering.

17 Sep 2025 0

92%

R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

Computation and Language

AI learns to find better answers by thinking.

26 May 2025 1

92%

GraphRAG-R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning

Machine Learning (CS)

Helps computers solve harder problems by thinking more.

31 Jul 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇭🇰 China, Hong Kong

Repos / Data Links

github.com

Page Count

22 pages

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

Helps AI learn to find and use information better.

Technical Abstract

SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

GraphRAG-R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning