Real-Time 3D Vision-Language Embedding Mapping
By: Christian Rauch , Björn Ellensohn , Linus Nwankwo and more
Potential Business Impact:
Robots understand and find objects by voice.
A metric-accurate semantic 3D representation is essential for many robotic tasks. This work proposes a simple, yet powerful, way to integrate the 2D embeddings of a Vision-Language Model in a metric-accurate 3D representation at real-time. We combine a local embedding masking strategy, for a more distinct embedding distribution, with a confidence-weighted 3D integration for more reliable 3D embeddings. The resulting metric-accurate embedding representation is task-agnostic and can represent semantic concepts on a global multi-room, as well as on a local object-level. This enables a variety of interactive robotic applications that require the localisation of objects-of-interest via natural language. We evaluate our approach on a variety of real-world sequences and demonstrate that these strategies achieve a more accurate object-of-interest localisation while improving the runtime performance in order to meet our real-time constraints. We further demonstrate the versatility of our approach in a variety of interactive handheld, mobile robotics and manipulation tasks, requiring only raw image data.
Similar Papers
Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy
Robotics
Robots understand and act on spoken commands.
Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy: A Review
Robotics
Robots understand and act on spoken commands.
GeoVLA: Empowering 3D Representations in Vision-Language-Action Models
Robotics
Robots understand 3D space to do tasks better.