Score: 1

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning

Published: August 4, 2025 | arXiv ID: 2508.02219v1

By: Dongchi Huang , Zhirui Fang , Tianle Zhang and more

Potential Business Impact:

Teaches robots new tasks with just a few examples.

Vision-Language-Action (VLA) models demonstrate significant potential for developing generalized policies in real-world robotic control. This progress inspires researchers to explore fine-tuning these models with Reinforcement Learning (RL). However, fine-tuning VLA models with RL still faces challenges related to sample efficiency, compatibility with action chunking, and training stability. To address these challenges, we explore the fine-tuning of VLA models through offline reinforcement learning incorporating action chunking. In this work, we propose Chunked RL, a novel reinforcement learning framework specifically designed for VLA models. Within this framework, we extend temporal difference (TD) learning to incorporate action chunking, a prominent characteristic of VLA models. Building upon this framework, we propose CO-RFT, an algorithm aimed at fine-tuning VLA models using a limited set of demonstrations (30 to 60 samples). Specifically, we first conduct imitation learning (IL) with full parameter fine-tuning to initialize both the backbone and the policy. Subsequently, we implement offline RL with action chunking to optimize the pretrained policy. Our empirical results in real-world environments demonstrate that CO-RFT outperforms previous supervised methods, achieving a 57% improvement in success rate and a 22.3% reduction in cycle time. Moreover, our method exhibits robust positional generalization capabilities, attaining a success rate of 44.3% in previously unseen positions.

Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach

Robotics

Teaches robots new jobs by talking to them.

17 Sep 2025 2

90%

Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding

Robotics

Robots learn tasks faster and better.

4 Mar 2025 0

90%

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Robotics

Robots learn to do new tasks better with less data.

11 Sep 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

9 pages

CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning

Teaches robots new tasks with just a few examples.

Technical Abstract

Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach

Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning