Score: 0

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

Published: September 24, 2025 | arXiv ID: 2509.19883v1

By: Junchuan Zhao , Wei Zeng , Tianle Lyu and more

Potential Business Impact:

Makes computer singing sound more like real people.

Business Areas:

Speech Recognition Data and Analytics, Software

Singing Voice Synthesis (SVS) aims to generate expressive vocal performances from structured musical inputs such as lyrics and pitch sequences. While recent progress in discrete codec-based speech synthesis has enabled zero-shot generation via in-context learning, directly extending these techniques to SVS remains non-trivial due to the requirement for precise melody control. In particular, prompt-based generation often introduces prosody leakage, where pitch information is inadvertently entangled within the timbre prompt, compromising controllability. We present CoMelSinger, a zero-shot SVS framework that enables structured and disentangled melody control within a discrete codec modeling paradigm. Built on the non-autoregressive MaskGCT architecture, CoMelSinger replaces conventional text inputs with lyric and pitch tokens, preserving in-context generalization while enhancing melody conditioning. To suppress prosody leakage, we propose a coarse-to-fine contrastive learning strategy that explicitly regularizes pitch redundancy between the acoustic prompt and melody input. Furthermore, we incorporate a lightweight encoder-only Singing Voice Transcription (SVT) module to align acoustic tokens with pitch and duration, offering fine-grained frame-level supervision. Experimental results demonstrate that CoMelSinger achieves notable improvements in pitch accuracy, timbre consistency, and zero-shot transferability over competitive baselines.

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

Sound

Makes computers sing any song with any words.

4 Dec 2025 1

90%

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Sound

Makes one person's singing sound like another.

4 Dec 2025 1

89%

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

Sound

Makes singing sound more emotional and real.

8 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Page Count

13 pages

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

Makes computer singing sound more like real people.

Technical Abstract

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence