Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task
By: Kanata Suzuki, Shota Shimizu, Tetsuya Ogata
From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision Language Models (VLMs) do not include low-level motion information from robots in their training datasets, video understanding including trajectory information remains a significant challenge. In this study, we assess two capabilities of VLMs through a video captioning task with low-level robot motion information: (1) automatic captioning of robot tasks and (2) segmentation of a series of tasks. Both capabilities are expected to enhance the efficiency of robot imitation learning by linking language and motion and serve as a measure of the foundation model's performance. The proposed method generates multiple "scene" captions using image captions and trajectory data from robot tasks. The full task caption is then generated by summarizing these individual captions. Additionally, the method performs subtask segmentation by comparing the similarity between text embeddings of image captions. In both captioning tasks, the proposed method aims to improve performance by providing the robot's motion data - joint and end-effector states - as input to the VLM. Simulator experiments were conducted to validate the effectiveness of the proposed method.
Similar Papers
Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance
Robotics
Robot understands what you want and helps you.
Improving Generalization of Language-Conditioned Robot Manipulation
Robotics
Robots learn to move objects with few examples.
ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models
Robotics
Robots learn to explore and do tasks better.