Improving Generalization of Language-Conditioned Robot Manipulation
By: Chenglin Cui , Chaoran Zhu , Changjae Oh and more
Potential Business Impact:
Robots learn to move objects with few examples.
The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune VLMs for operating in unseen environments. In this paper, we present a framework that learns object-arrangement tasks from just a few demonstrations. We propose a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking the object, and a region determination stage for placing the object. We present an instance-level semantic fusion module that aligns the instance-level image crops with the text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method on both simulation and real-world robotic environments. Our method, fine-tuned with a few demonstrations, improves generalization capability and demonstrates zero-shot ability in real-robot manipulation scenarios.
Similar Papers
LaVA-Man: Learning Visual Action Representations for Robot Manipulation
Robotics
Robots learn to grab things by looking and reading.
Generalist Robot Manipulation beyond Action Labeled Data
Robotics
Robots learn new tasks from watching videos.
VLM-driven Skill Selection for Robotic Assembly Tasks
Robotics
Robot builds things by watching and listening.