3D Aware Region Prompted Vision Language Model
By: An-Chieh Cheng , Yang Fu , Yukang Chen and more
Potential Business Impact:
Lets computers understand 3D spaces from 2D pictures.
We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.
Similar Papers
Real-Time 3D Object Detection with Inference-Aligned Learning
CV and Pattern Recognition
Helps robots see and understand objects in 3D.
Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes
CV and Pattern Recognition
Helps robots understand 3D space from their own eyes.
Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
CV and Pattern Recognition
Makes 3D pictures match words better.