Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
By: Tomoya Yoshida , Shuhei Kurita , Taichi Nishimura and more
Potential Business Impact:
Robots learn to use tools by watching videos.
Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets -- constructed globally with substantial effort -- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision.
Similar Papers
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
CV and Pattern Recognition
Robots copy object moves from online videos.
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
CV and Pattern Recognition
Robots learn to move objects together better.
Object-centric 3D Motion Field for Robot Learning from Human Videos
Robotics
Robots learn to do tasks by watching videos.