Score: 0

Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder

Published: August 3, 2025 | arXiv ID: 2508.01796v1

By: Runxuan Yang , Kai Li , Guo Chen and more

Potential Business Impact:

Makes fake singing voices sound real.

This paper addresses the challenge of enhancing the realism of vocoder-generated singing voice audio by mitigating the distinguishable disparities between synthetic and real-life recordings, particularly in high-frequency spectrogram components. Our proposed approach combines two innovations: an explicit linear spectrogram estimation step using denoising diffusion process with DiT-based neural network architecture optimized for time-frequency data, and a redesigned vocoder based on Vocos specialized in handling large linear spectrograms with increased frequency bins. This integrated method can produce audio with high-fidelity spectrograms that are challenging for both human listeners and machine classifiers to differentiate from authentic recordings. Objective and subjective evaluations demonstrate that our streamlined approach maintains high audio quality while achieving this realism. This work presents a substantial advancement in overcoming the limitations of current vocoding techniques, particularly in the context of adversarial attacks on fake spectrogram detection.

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

Sound

Makes AI sing songs with real-sounding voices.

10 Oct 2025 0

87%

Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty

Sound

Makes computer singing sound more natural and varied.

16 Dec 2025 0

87%

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

Audio and Speech Processing

Makes quiet sounds loud and clear.

1 Oct 2025 2

View PDF Login to Bookmark

Page Count

7 pages

Enhancing Spectrogram Realism in Singing Voice Synthesis via Explicit Bandwidth Extension Prior to Vocoder

Makes fake singing voices sound real.

Technical Abstract

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching