Score: 0

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Published: October 11, 2025 | arXiv ID: 2510.09976v1

By: Mingyang Lyu , Yinqian Sun , Erliang Lin and more

Potential Business Impact:

Teaches robots to learn new tasks by watching.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $\pi_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $\pi_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $\pi_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Machine Learning (CS)

Teaches robots to do more tasks faster.

29 Oct 2025 1

90%

Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models

Robotics

Robots learn to do tasks better by practicing.

4 Sep 2025 0

90%

Reinforcing Action Policies by Prophesying

Robotics

Teaches robots to learn new tasks faster.

25 Nov 2025 0

View PDF Login to Bookmark

Page Count

9 pages

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Teaches robots to learn new tasks by watching.

Technical Abstract

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models

Reinforcing Action Policies by Prophesying