Score: 2

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Published: March 23, 2025 | arXiv ID: 2503.18013v1

By: Yufei Zhan , Yousong Zhu , Shurong Zheng and more

Potential Business Impact:

Teaches AI to understand pictures better, faster.

Business Areas:

Image Recognition Data and Analytics, Software

Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

CV and Pattern Recognition

Makes computers understand pictures better using games.

10 Apr 2025 1

92%

Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning

Machine Learning (CS)

Teaches computers to learn from old data alone.

3 Apr 2025 0

91%

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models

Machine Learning (CS)

AI learns to guide robots better with AI feedback.

15 Jun 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

14 pages

Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Teaches AI to understand pictures better, faster.

Technical Abstract

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models