Score: 0

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

Published: November 28, 2025 | arXiv ID: 2511.22961v1

By: Chen Li, Eric Peh, Basura Fernando

Potential Business Impact:

Helps computers understand 3D spaces from pictures and words.

Business Areas:

Image Recognition Data and Analytics, Software

Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.

LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

CV and Pattern Recognition

Lets computers understand 3D objects from flat pictures.

20 Nov 2025 1

92%

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

CV and Pattern Recognition

Makes 3D pictures match words better.

18 Nov 2025 1

92%

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

CV and Pattern Recognition

Teaches computers to understand 3D space from pictures.

26 Nov 2025 1

View PDF Login to Bookmark

Page Count

14 pages

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

Helps computers understand 3D spaces from pictures and words.

Technical Abstract

LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning