Score: 0

Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task

Published: December 24, 2025 | arXiv ID: 2512.20876v1

By: Kanata Suzuki, Shota Shimizu, Tetsuya Ogata

From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision Language Models (VLMs) do not include low-level motion information from robots in their training datasets, video understanding including trajectory information remains a significant challenge. In this study, we assess two capabilities of VLMs through a video captioning task with low-level robot motion information: (1) automatic captioning of robot tasks and (2) segmentation of a series of tasks. Both capabilities are expected to enhance the efficiency of robot imitation learning by linking language and motion and serve as a measure of the foundation model's performance. The proposed method generates multiple "scene" captions using image captions and trajectory data from robot tasks. The full task caption is then generated by summarizing these individual captions. Additionally, the method performs subtask segmentation by comparing the similarity between text embeddings of image captions. In both captioning tasks, the proposed method aims to improve performance by providing the robot's motion data - joint and end-effector states - as input to the VLM. Simulator experiments were conducted to validate the effectiveness of the proposed method.

Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

Robotics

Robot understands what you want and helps you.

14 Aug 2025 0

91%

Improving Generalization of Language-Conditioned Robot Manipulation

Robotics

Robots learn to move objects with few examples.

4 Aug 2025 1

91%

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

Robotics

Robots learn to explore and do tasks better.

16 Aug 2025 0

View PDF Login to Bookmark

Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task

Technical Abstract

Utilizing Vision-Language Models as Action Models for Intent Recognition and Assistance

Improving Generalization of Language-Conditioned Robot Manipulation

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models