Score: 0

Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

Published: October 29, 2025 | arXiv ID: 2510.25713v1

By: Boshi An, Chenyu Yang, Robert Katzschmann

Potential Business Impact:

Robots learn to help people with simple words.

Business Areas:

Robotics Hardware, Science and Engineering, Software

We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Robotics

Teaches robots to do tasks by watching people.

24 Oct 2025 0

91%

Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

Robotics

Robots learn to build things by watching videos.

7 Aug 2025 0

91%

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

Robotics

Robots learn to do tasks better by watching and listening.

14 Nov 2025 0

View PDF Login to Bookmark

Page Count

14 pages

Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

Robots learn to help people with simple words.

Technical Abstract

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation