Score: 0

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Published: November 24, 2025 | arXiv ID: 2511.19315v1

By: Weiliang Tang , Jialin Gao , Jia-Hui Pan and more

Potential Business Impact:

Helps robots understand and do new tasks.

Business Areas:

Semantic Web Internet Services

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

VLM-driven Skill Selection for Robotic Assembly Tasks

Robotics

Robot builds things by watching and listening.

7 Nov 2025 0

91%

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Robotics

Robots learn to do tasks by watching and listening.

18 Aug 2025 1

91%

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Robotics

Robots learn to do tricky jobs with speed and accuracy.

7 Mar 2025 0

View PDF Login to Bookmark

Page Count

56 pages

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Helps robots understand and do new tasks.

Technical Abstract

VLM-driven Skill Selection for Robotic Assembly Tasks

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation