RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
By: Wentang Chen , Shougao Zhang , Yiman Zhang and more
Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
Similar Papers
SPATIALGEN: Layout-guided 3D Indoor Scene Generation
CV and Pattern Recognition
Builds realistic 3D rooms from pictures.
RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation
CV and Pattern Recognition
Creates realistic 3D rooms from simple text.
MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models
CV and Pattern Recognition
Creates realistic 3D rooms from simple drawings.