Score: 1

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Published: March 7, 2025 | arXiv ID: 2503.05132v2

By: Hengguang Zhou , Xirui Li , Ruochen Wang and more

Potential Business Impact:

Makes AI understand pictures and solve problems.

Business Areas:

A/B Testing Data and Analytics

Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

CV and Pattern Recognition

Teaches computers to solve math problems better.

9 Mar 2025 1

89%

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Computation and Language

Teaches computers to think better, not just copy.

10 Apr 2025 2

88%

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Machine Learning (CS)

Teaches computers to think step-by-step better.

24 Mar 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

10 pages

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Makes AI understand pictures and solve problems.

Technical Abstract

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild