Score: 2

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Published: October 31, 2025 | arXiv ID: 2510.27135v1

By: Tong Shen , Jingai Yu , Dong Zhou and more

BigTech Affiliations: AMD

Potential Business Impact:

Makes AI create pictures faster with less power.

Business Areas:

Digital Media Media and Entertainment

Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

CV and Pattern Recognition

Changes pictures using words, better than before.

11 Aug 2025 1

91%

Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion

CV and Pattern Recognition

Makes AI pictures match words better.

5 Jan 2026 0

90%

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

CV and Pattern Recognition

Makes AI create better pictures faster.

20 Mar 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

11 pages

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Makes AI create pictures faster with less power.

Technical Abstract

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention