Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback
By: Jingyi Chen , Ju Seung Byun , Micha Elsner and more
Potential Business Impact:
Makes talking computers sound more natural, faster.
Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67\% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.
Similar Papers
RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS
Sound
Makes computer voices sound more real and emotional.
Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
Machine Learning (CS)
Teaches AI to solve math and code problems better.
Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
Machine Learning (CS)
Makes robots learn faster and better from mistakes.