Score: 2

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Published: December 16, 2025 | arXiv ID: 2512.14068v1

By: Shuang Cheng , Yuhua Jiang , Zineng Zhou and more

Potential Business Impact:

Makes computers understand pictures and words faster.

Business Areas:

Image Recognition Data and Analytics, Software

Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Machine Learning (CS)

Makes AI think faster and better.

7 Oct 2025 2

89%

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

Computation and Language

Makes AI write faster by using new tricks.

7 Dec 2025 2

89%

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

CV and Pattern Recognition

Teaches robots to do tasks by watching and listening.

27 Aug 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

19 pages

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Makes computers understand pictures and words faster.

Technical Abstract

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies