DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
By: Chenyang Gu , Yewen Pu , Bruce Yang and more
Potential Business Impact:
Helps AI find answers by searching the internet.
Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.
Similar Papers
Soft Adaptive Policy Optimization
Machine Learning (CS)
Teaches AI to learn better and faster.
Group Sequence Policy Optimization
Machine Learning (CS)
Makes AI learn faster and better.
Soft Policy Optimization: Online Off-Policy RL for Sequence Models
Machine Learning (CS)
Teaches computers to learn from more examples faster.