WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration
By: Kevin Putra Santoso, Rizka Wakhidatus Sholikah, Raden Venantius Hari Ginardi
Potential Business Impact:
Cleans up noisy audio, even long missing parts.
High-quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log-Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state-of-the-art methods in terms of perceptual quality and speech clarity, with WB-PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine-tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.
Similar Papers
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Audio and Speech Processing
Makes computers understand spoken words better.
Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis
CV and Pattern Recognition
Makes pictures look super clear and detailed.
Low-Complexity MIMO Channel Estimation with Latent Diffusion Models
Information Theory
Improves wireless signals for faster internet.