Revealing the Attention Floating Mechanism in Masked Diffusion Models
By: Xin Dai , Pengcheng Huang , Zhenghao Liu and more
Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes and datasets are available at https://github.com/NEUIR/Attention-Floating.
Similar Papers
Attention Sinks in Diffusion Language Models
Computation and Language
Helps computers learn language more like humans.
Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models
Machine Learning (CS)
Makes AI understand long sentences better.
Masked Diffusion Models are Secretly Learned-Order Autoregressive Models
Machine Learning (CS)
Teaches computers to create ordered text better.