DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model
By: Xueyuan Chen , Dongchao Yang , Wenxuan Wu and more
Potential Business Impact:
Makes speech understandable while keeping the voice.
Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of speech reconstruction. Our model comprises: (i) a speech content encoder for phoneme embedding restoration via pre-trained self-supervised learning (SSL) speech foundation models; (ii) a speaker identity encoder for speaker-aware identity preservation by in-context learning mechanism; (iii) a diffusion-based speech generator to reconstruct the speech based on the restored phoneme embedding and preserved speaker identity. Through evaluations on the widely-used UASpeech corpus, our proposed model shows notable enhancements in speech intelligibility and speaker similarity.
Similar Papers
DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
Audio and Speech Processing
Cleans up noisy and echoey voices perfectly.
Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios
Sound
Helps computers understand speech with unclear pronunciation.
LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
Machine Learning (CS)
Helps computers think and fix their own mistakes.