RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph
By: Hecheng Wang , Jiankun Ren , Jia Yu and more
Potential Business Impact:
Robot finds things using one camera and words.
Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present \textbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a \textbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a \textbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining \textbf{active perception}, \textbf{interactive perception}, and \textbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.
Similar Papers
GraspView: Active Perception Scoring and Best-View Optimization for Robotic Grasping in Cluttered Environments
Robotics
Robots grab things better using only pictures.
Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning
CV and Pattern Recognition
Lets computers understand 3D worlds like humans.
Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning
Robotics
Robots grab things when you just tell them what.