Score: 0

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Published: October 14, 2025 | arXiv ID: 2510.12801v1

By: Kartik Narayan , Yang Xu , Tian Cao and more

Potential Business Impact:

Lets computers search the web for answers.

Business Areas:

Visual Search Internet Services

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents

Artificial Intelligence

Helps computers understand pictures and text together.

29 Aug 2025 2

91%

Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models

CV and Pattern Recognition

Finds fake videos and pictures online.

6 Jul 2025 0

90%

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

CV and Pattern Recognition

Shows how AI understands pictures and words.

27 Aug 2025 0

View PDF Login to Bookmark

Page Count

24 pages

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Lets computers search the web for answers.

Technical Abstract

MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents

Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding