Score: 3

EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Published: December 12, 2025 | arXiv ID: 2512.11715v1

By: Wei Chow , Linfeng Li , Lingdong Kong and more

Potential Business Impact:

Edits pictures without messing up other parts.

Business Areas:

Photo Editing Content and Publishing, Media and Entertainment

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

CV and Pattern Recognition

Creates realistic faces from masks and words.

16 Nov 2025 0

89%

GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data

CV and Pattern Recognition

Teaches computers to understand and change faces.

21 Oct 2025 1

88%

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

CV and Pattern Recognition

Changes pictures using words, better than before.

11 Aug 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com huggingface.co

Page Count

47 pages

EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Edits pictures without messing up other parts.

Technical Abstract

Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing