Score: 0

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Published: March 13, 2025 | arXiv ID: 2503.10615v2

By: Yi Yang , Xiaoxuan He , Hongkun Pan and more

Potential Business Impact:

Helps computers understand pictures and words together.

Business Areas:

Image Recognition Data and Analytics, Software

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Computation and Language

Computers understand pictures and words together.

23 Mar 2025 1

91%

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

CV and Pattern Recognition

Teaches computers to solve math problems better.

9 Mar 2025 1

91%

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

CV and Pattern Recognition

Helps computers understand and reason about many things.

8 May 2025 1

View PDF Login to Bookmark

Page Count

27 pages

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Helps computers understand pictures and words together.

Technical Abstract

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models