BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
By: Jason Wei , Zhiqing Sun , Spencer Papay and more
Potential Business Impact:
Tests how well computers can find hidden internet answers.
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
Similar Papers
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Computation and Language
Tests AI's ability to understand web pages with pictures.
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Computation and Language
Tests AI's ability to find info on Chinese web.
InteractComp: Evaluating Search Agents With Ambiguous Queries
Computation and Language
Helps search tools ask questions to find answers.