Score: 1

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Published: August 11, 2025 | arXiv ID: 2508.07519v1

By: Joonghyuk Shin , Alchan Hwang , Yujin Kim and more

Potential Business Impact:

Changes pictures using words, better than before.

Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT's attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT's behavioral patterns.

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

CV and Pattern Recognition

Makes AI create better pictures faster.

20 Mar 2025 2

93%

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

CV and Pattern Recognition

Makes AI create pictures faster with less power.

31 Oct 2025 2

92%

Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion

CV and Pattern Recognition

Makes AI pictures match words better.

5 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Page Count

29 pages

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Changes pictures using words, better than before.

Technical Abstract

EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion