Score: 1

AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

Published: December 22, 2025 | arXiv ID: 2512.19126v1

By: Zihan Lin , Xiaohan Wang , Hexiong Yang and more

BigTech Affiliations: Meituan

Potential Business Impact:

Teaches AI to reason better with tools.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

Agentic Reinforced Policy Optimization

Machine Learning (CS)

Teaches AI to use tools better in conversations.

26 Jul 2025 2

90%

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

Artificial Intelligence

Lets computers use calculators for math problems.

8 Oct 2025 2

90%

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Computation and Language

Teaches computers to think better from the start.

17 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

20 pages

AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

Teaches AI to reason better with tools.

Technical Abstract

Agentic Reinforced Policy Optimization

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning