SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion
By: Zhao Guo , Ziqian Ning , Guobin Ma and more
Potential Business Impact:
Changes your voice to sound like someone else instantly.
Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.
Similar Papers
O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Sound
Changes your voice to sound like anyone.
RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding
Audio and Speech Processing
Changes your voice to sound like someone else instantly.
MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
Audio and Speech Processing
Changes voices to sound like anyone else.