Score: 0

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Published: May 6, 2025 | arXiv ID: 2505.03181v1

By: Jake Grigsby , Yuke Zhu , Michael Ryoo and more

Potential Business Impact:

Teaches computers to see and follow instructions.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data and provide agents with the visual reasoning necessary for new applications in areas such as computer automation. However, agent tasks emphasize skills where accessible open-weight VLMs lag behind their LLM equivalents. For example, VLMs are less capable of following an environment's strict output syntax requirements and are more focused on open-ended question answering. Overcoming these limitations requires supervised fine-tuning (SFT) on task-specific expert demonstrations. Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective. RL lets us fine-tune VLMs to agent tasks while learning from the unsuccessful decisions of our own model or more capable (larger) models. We explore an off-policy RL solution that retains the stability and simplicity of the widely used SFT workflow while allowing our agent to self-improve and learn from low-quality datasets. We demonstrate this technique with two open-weight VLMs across three multi-modal agent domains.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

CV and Pattern Recognition

Makes computers understand pictures better using games.

10 Apr 2025 1

91%

Teaching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning

Machine Learning (CS)

Teaches robots new skills faster with smart advice.

25 Sep 2025 0

91%

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Computation and Language

Teaches computers to solve problems in many ways.

9 Mar 2025 0

View PDF Login to Bookmark

Page Count

22 pages

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Teaches computers to see and follow instructions.

Technical Abstract

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Teaching RL Agents to Act Better: VLM as Action Advisor for Online Reinforcement Learning

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks