Score: 0

EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

Published: October 7, 2025 | arXiv ID: 2510.05758v1

By: Haoxun Li , Yu Liu , Yuqing Sun and more

Potential Business Impact:

Makes AI voices show feelings and emphasis better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.

Page Count
5 pages

Category
Computer Science:
Sound