Score: 1

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Published: December 17, 2025 | arXiv ID: 2512.15713v1

By: Lunbin Zeng , Jingfeng Yao , Bencheng Liao and more

Potential Business Impact:

Makes AI understand pictures and text faster.

Business Areas:

Autonomous Vehicles Transportation

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Computation and Language

Makes AI write faster without losing quality.

16 Dec 2025 1

91%

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

CV and Pattern Recognition

Makes self-driving cars better at tricky situations.

4 Dec 2025 0

91%

Discrete Diffusion in Large Language and Multimodal Models: A Survey

Machine Learning (CS)

Makes AI talk and create much faster.

16 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Makes AI understand pictures and text faster.

Technical Abstract

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Discrete Diffusion in Large Language and Multimodal Models: A Survey