SpatialLM: Training Large Language Models for Structured Indoor Modeling
By: Yongsen Mao , Junhao Zhong , Chuan Fang and more
Potential Business Impact:
Lets computers understand 3D spaces like rooms.
SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.
Similar Papers
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
CV and Pattern Recognition
Teaches computers to understand 3D space like humans.
How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM
CV and Pattern Recognition
Helps computers understand 3D worlds like we do.
Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs
Robotics
Helps robots understand buildings to move around better.