Score: 0

RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph

Published: August 18, 2025 | arXiv ID: 2508.12916v1

By: Hecheng Wang , Jiankun Ren , Jia Yu and more

Potential Business Impact:

Robot finds things using one camera and words.

Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present \textbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a \textbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a \textbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining \textbf{active perception}, \textbf{interactive perception}, and \textbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.

GraspView: Active Perception Scoring and Best-View Optimization for Robotic Grasping in Cluttered Environments

Robotics

Robots grab things better using only pictures.

6 Nov 2025 0

88%

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

CV and Pattern Recognition

Lets computers understand 3D worlds like humans.

8 Nov 2025 0

87%

Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning

Robotics

Robots grab things when you just tell them what.

9 Sep 2025 1

View PDF Login to Bookmark

Page Count

9 pages

RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph

Robot finds things using one camera and words.

Technical Abstract

GraspView: Active Perception Scoring and Best-View Optimization for Robotic Grasping in Cluttered Environments

Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning

Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning