Score: 2

Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding

Published: August 27, 2025 | arXiv ID: 2508.19529v1

By: Bowen Sun , Yujun Cai , Ming-Hsuan Yang and more

Potential Business Impact:

Teaches AI to write better by training it block by block.

Business Areas:

Text Analytics Data and Analytics, Software

Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.

WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

Computation and Language

Makes AI better at solving puzzles and math.

25 Sep 2025 0

90%

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Computation and Language

Makes AI write better and faster.

18 Sep 2025 1

88%

Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting

Computation and Language

Keeps AI smart while teaching it new tricks.

11 Jun 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com huggingface.co huggingface.co huggingface.co

Page Count

16 pages

Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding

Teaches AI to write better by training it block by block.

Technical Abstract

WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting