ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation
By: Yangcen Liu , Woo Chul Shin , Yunhai Han and more
Potential Business Impact:
Teaches robots to do tasks from human videos.
Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation. The project website can be found at https://sites.google.com/view/immimic.
Similar Papers
MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos
Robotics
Robots learn new tasks by watching people play.
From Generated Human Videos to Physically Plausible Robot Trajectories
Robotics
Robots copy human moves from fake videos.
UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
Robotics
Robots learn new skills by watching humans.