Score: 2

A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Published: October 9, 2025 | arXiv ID: 2510.07958v1

By: Fengji Zhang , Xinyao Niu , Chengyang Ying and more

Potential Business Impact:

Helps computers answer questions with many right answers.

Business Areas:

Semantic Search Internet Services

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Computation and Language

Helps computers answer tricky questions better.

3 Nov 2025 3

89%

Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

Computation and Language

Makes AI learn hard jobs without people.

30 Sep 2025 0

89%

Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Computation and Language

Helps computers answer harder questions by searching better.

11 Oct 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

47 pages

A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Helps computers answer questions with many right answers.

Technical Abstract

DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning