Score: 1

LaVA-Man: Learning Visual Action Representations for Robot Manipulation

Published: August 26, 2025 | arXiv ID: 2508.19391v1

By: Chaoran Zhu , Hengyi Wang , Yik Lung Pang and more

Potential Business Impact:

Robots learn to grab things by looking and reading.

Business Areas:

Image Recognition Data and Analytics, Software

Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-textual associations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the \textit{Omni-Object Pick-and-Place} dataset, which consists of annotated robot tabletop manipulation episodes, including 180 object classes and 3,200 instances with corresponding textual instructions. This dataset enables the model to acquire diverse object priors and allows for a more comprehensive evaluation of its generalisation capability across object instances. Experimental results on the five benchmarks, including both simulated and real-robot validations, demonstrate that our method outperforms prior art.

Improving Generalization of Language-Conditioned Robot Manipulation

Robotics

Robots learn to move objects with few examples.

4 Aug 2025 1

91%

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Robotics

Robots learn new tasks from just one video.

8 Dec 2025 0

91%

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Robotics

Teaches robots to do tasks by watching people.

24 Oct 2025 0

View PDF Login to Bookmark

Page Count

20 pages

LaVA-Man: Learning Visual Action Representations for Robot Manipulation

Robots learn to grab things by looking and reading.

Technical Abstract

Improving Generalization of Language-Conditioned Robot Manipulation

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos