Score: 0

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

Published: September 8, 2025 | arXiv ID: 2509.07038v1

By: Yerin Ryu, Inseop Shin, Chanwoo Kim

Potential Business Impact:

Makes singing sound more emotional and real.

Business Areas:
Speech Recognition Data and Analytics, Software

Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control--temporal loudness variation essential for musical expressiveness--and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.

Page Count
7 pages

Category
Computer Science:
Sound