Lightweight and perceptually-guided voice conversion for electro-laryngeal speech
By: Benedikt Mayrhofer , Franz Pernkopf , Philipp Aichinger and more
Potential Business Impact:
Makes robotic voices sound more human.
Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.
Similar Papers
Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition
Computation and Language
Helps computers understand noisy speech better.
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
Sound
Helps computers hear the right voice in noisy rooms.
Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation
Audio and Speech Processing
Lets computers understand thoughts as words.