RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS
By: Cong Wang , Changfeng Gao , Yang Xiang and more
Potential Business Impact:
Makes computer voices sound more real and emotional.
Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.
Similar Papers
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Sound
Makes computer voices sound more natural and human.
Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback
Sound
Makes talking computers sound more natural, faster.
VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision
Machine Learning (CS)
Teaches AI to learn better from mistakes.