Score: 0

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

Published: September 8, 2025 | arXiv ID: 2509.07038v1

By: Yerin Ryu, Inseop Shin, Chanwoo Kim

Potential Business Impact:

Makes singing sound more emotional and real.

Business Areas:

Speech Recognition Data and Analytics, Software

Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control--temporal loudness variation essential for musical expressiveness--and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

Sound

Makes computers sing any song with any words.

4 Dec 2025 1

89%

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

Sound

Makes computer singing sound more like real people.

24 Sep 2025 0

89%

Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation

Sound

Helps computers judge singing better with words.

2 Dec 2025 2

View PDF Login to Bookmark

Page Count

7 pages

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

Makes singing sound more emotional and real.

Technical Abstract

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation