DIFFA: Large Language Diffusion Models Can Listen and Understand
By: Jiaming Zhou , Hongjie Chen , Shiwan Zhao and more
Potential Business Impact:
Lets computers understand spoken words better.
Plain English Summary
Imagine talking to your computer or smart speaker, and it actually understands what you mean, not just the words you say. This new AI can do that by learning from spoken conversations in a smarter way than before. This means future voice assistants could be much better at understanding complex requests and helping you out.
Recent advances in Large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm, offering improved controllability, bidirectional context modeling, and robust generation. However, their application to the audio modality remains underexplored. In this work, we introduce \textbf{DIFFA}, the first diffusion-based Large Audio-Language Model designed to perform spoken language understanding. DIFFA integrates a frozen diffusion language model with a lightweight dual-adapter architecture that bridges speech understanding and natural language reasoning. We employ a two-stage training pipeline: first, aligning semantic representations via an ASR objective; then, learning instruction-following abilities through synthetic audio-caption pairs automatically generated by prompting LLMs. Despite being trained on only 960 hours of ASR and 127 hours of synthetic instruction data, DIFFA demonstrates competitive performance on major benchmarks, including MMSU, MMAU, and VoiceBench, outperforming several autoregressive open-source baselines. Our results reveal the potential of diffusion-based language models for efficient and scalable audio understanding, opening a new direction for speech-driven AI. Our code will be available at https://github.com/NKU-HLT/DIFFA.git.
Similar Papers
DIFFA: Large Language Diffusion Models Can Listen and Understand
Sound
Lets computers understand spoken words like humans.
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Audio and Speech Processing
Makes computers understand spoken words better.
Large Language Diffusion Models
Computation and Language
New AI learns language like magic, not just predicting words.