PseudoVC: Improving One-shot Voice Conversion with Pseudo Paired Data
By: Songjun Cao , Qinghua Wu , Jie Chen and more
Potential Business Impact:
Changes one person's voice to sound like another.
As parallel training data is scarce for one-shot voice conversion (VC) tasks, waveform reconstruction is typically performed by various VC systems. A typical one-shot VC system comprises a content encoder and a speaker encoder. However, two types of mismatches arise: one for the inputs to the content encoder during training and inference, and another for the inputs to the speaker encoder. To address these mismatches, we propose a novel VC training method called \textit{PseudoVC} in this paper. First, we introduce an innovative information perturbation approach named \textit{Pseudo Conversion} to tackle the first mismatch problem. This approach leverages pretrained VC models to convert the source utterance into a perturbed utterance, which is fed into the content encoder during training. Second, we propose an approach termed \textit{Speaker Sampling} to resolve the second mismatch problem, which will substitute the input to the speaker encoder by another utterance from the same speaker during training. Experimental results demonstrate that our proposed \textit{Pseudo Conversion} outperforms previous information perturbation methods, and the overall \textit{PseudoVC} method surpasses publicly available VC models. Audio examples are available.
Similar Papers
O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Sound
Changes your voice to sound like anyone.
Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training
Sound
Changes voices to sound like anyone else.
Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching
Audio and Speech Processing
Changes your voice to sound like anyone else.