Imagination at Inference: Synthesizing In-Hand Views for Robust Visuomotor Policy Inference
By: Haoran Ding , Anqing Duan , Zezhou Sun and more
Potential Business Impact:
Robots can "see" better without extra cameras.
Visual observations from different viewpoints can significantly influence the performance of visuomotor policies in robotic manipulation. Among these, egocentric (in-hand) views often provide crucial information for precise control. However, in some applications, equipping robots with dedicated in-hand cameras may pose challenges due to hardware constraints, system complexity, and cost. In this work, we propose to endow robots with imaginative perception - enabling them to 'imagine' in-hand observations from agent views at inference time. We achieve this via novel view synthesis (NVS), leveraging a fine-tuned diffusion model conditioned on the relative pose between the agent and in-hand views cameras. Specifically, we apply LoRA-based fine-tuning to adapt a pretrained NVS model (ZeroNVS) to the robotic manipulation domain. We evaluate our approach on both simulation benchmarks (RoboMimic and MimicGen) and real-world experiments using a Unitree Z1 robotic arm for a strawberry picking task. Results show that synthesized in-hand views significantly enhance policy inference, effectively recovering the performance drop caused by the absence of real in-hand cameras. Our method offers a scalable and hardware-light solution for deploying robust visuomotor policies, highlighting the potential of imaginative visual reasoning in embodied agents.
Similar Papers
NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning
Robotics
Robots see more, learn faster, and do tasks better.
A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery
Robotics
Robots learn to hand things to people smoothly.
MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
Robotics
Makes robots learn long, complex tasks from videos.