BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
By: Litu Ou , Kuan Li , Huifeng Yin and more
Potential Business Impact:
Lets AI know when its answers are good.
Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
Similar Papers
BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Computation and Language
Helps AI know when it's unsure, saving time.
AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks
Artificial Intelligence
Boosts AI for multi-step complex tasks
CTTS: Collective Test-Time Scaling
Computation and Language
Helps AI learn better by working together.