Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
By: Liu He , Xiao Zeng , Yizhi Song and more
Potential Business Impact:
Teaches computers to understand pictures better.
Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.
Similar Papers
Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
CV and Pattern Recognition
Makes AI create realistic pictures and videos automatically.
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
CV and Pattern Recognition
Helps computers understand 3D spaces like we do.
3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs
CV and Pattern Recognition
Makes computers build 3D shapes from words.