Generating Novel and Realistic Speakers for Voice Conversion
By: Meiying Melissa Chen, Zhenyu Wang, Zhiyao Duan
Potential Business Impact:
Creates new voices for talking robots.
Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.
Similar Papers
Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder
Sound
Changes voices to sound like anyone, with feeling.
SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion
Sound
Changes your voice to sound like someone else instantly.
O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Sound
Changes your voice to sound like anyone.