Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
By: Joonghyuk Shin , Alchan Hwang , Yujin Kim and more
Potential Business Impact:
Changes pictures using words, better than before.
Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT's attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT's behavioral patterns.
Similar Papers
EDiT: Efficient Diffusion Transformers with Linear Compressed Attention
CV and Pattern Recognition
Makes AI create better pictures faster.
E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources
CV and Pattern Recognition
Makes AI create pictures faster with less power.
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach
CV and Pattern Recognition
Combines pictures using words to make better images.