Score: 0

MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

Published: September 23, 2025 | arXiv ID: 2509.18757v1

By: Omar Rayyan , John Abanes , Mahmoud Hafez and more

Potential Business Impact:

Robots learn better from more camera views.

Business Areas:
Motion Capture Media and Entertainment, Video

Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.

Country of Origin
🇺🇸 United States

Page Count
8 pages

Category
Computer Science:
Robotics