Score: 1

Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

Published: December 14, 2025 | arXiv ID: 2512.12822v1

By: Yongyuan Liang , Xiyao Wang , Yuanchen Ju and more

Potential Business Impact:

Helps computers understand 3D objects and scenes.

Business Areas:

Image Recognition Data and Analytics, Software

Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

CV and Pattern Recognition

Helps computers understand 3D spaces better.

2 Dec 2025 0

89%

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

CV and Pattern Recognition

Lets computers build and change 3D objects with words.

17 Nov 2025 1

89%

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

CV and Pattern Recognition

Makes 3D pictures match words better.

18 Nov 2025 1

View PDF Login to Bookmark

Page Count

26 pages

Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

Helps computers understand 3D objects and scenes.

Technical Abstract

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation