Score: 1

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Published: December 2, 2025 | arXiv ID: 2512.02536v1

By: Jian Yang , Dacheng Yin , Xiaoxuan He and more

Potential Business Impact:

Teaches AI to learn many new things without forgetting.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

CV and Pattern Recognition

Makes AI understand pictures and words much faster.

19 Nov 2025 1

90%

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

CV and Pattern Recognition

Lets computers see smarter, using less data.

3 Dec 2025 3

89%

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

CV and Pattern Recognition

Makes AI understand pictures better without using more power.

26 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 🇨🇳 China, Hong Kong

Page Count

19 pages

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Teaches AI to learn many new things without forgetting.

Technical Abstract

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens