Score: 0

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Published: December 18, 2025 | arXiv ID: 2512.16724v1

By: Yixiang Chen , Yan Huang , Keji He and more

When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .

Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

Robotics

Robotic eye learns to look and zoom for details.

19 Nov 2025 0

89%

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Robotics

Helps robots learn tasks by picking the best "eyes."

6 Oct 2025 1

88%

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

Robotics

Robots learn to grab objects better with 3D vision.

9 May 2025 0

View PDF Login to Bookmark

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Technical Abstract

Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks