Score: 2

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Published: December 30, 2025 | arXiv ID: 2512.24330v1

By: Yong Xien Chng , Tao Hu , Wenwen Tong and more

Potential Business Impact:

Lets computers see, search, and understand pictures.

Business Areas:

Visual Search Internet Services

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence

Robotics

Helps robots help people safely in their homes.

3 Nov 2025 0

90%

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Artificial Intelligence

Teaches computers to "think" with pictures and tools.

24 Nov 2025 2

90%

MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning

Artificial Intelligence

Helps computers think faster and learn new things.

6 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

27 pages

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Lets computers see, search, and understand pictures.

Technical Abstract

MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning