Score: 0

Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Published: August 7, 2025 | arXiv ID: 2508.05580v1

By: Kunyu Feng , Yue Ma , Xinhua Zhang and more

Potential Business Impact:

Makes AI create realistic pictures and videos automatically.

With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.

Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Graphics

Teaches computers to understand pictures better.

11 Jul 2025 1

89%

3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

CV and Pattern Recognition

Makes computers build 3D shapes from words.

12 Aug 2025 1

89%

Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

CV and Pattern Recognition

Makes VR training lessons automatically from text.

19 Jul 2025 0

View PDF Login to Bookmark

Page Count

11 pages

Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Makes AI create realistic pictures and videos automatically.

Technical Abstract

Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task