Rethinking Intermediate Representation for VLM-based Robot Manipulation
By: Weiliang Tang , Jialin Gao , Jia-Hui Pan and more
Potential Business Impact:
Helps robots understand and do new tasks.
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.
Similar Papers
VLM-driven Skill Selection for Robotic Assembly Tasks
Robotics
Robot builds things by watching and listening.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Robotics
Robots learn to do tasks by watching and listening.
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation
Robotics
Robots learn to do tricky jobs with speed and accuracy.