MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents
By: Xijia Tao , Yihua Teng , Xinxing Su and more
Potential Business Impact:
Helps computers understand pictures and text together.
Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.
Similar Papers
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
CV and Pattern Recognition
Lets computers search the web for answers.
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Computation and Language
Tests AI's ability to understand web pages with pictures.
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
Computation and Language
Tests AI that finds answers online better.