Score: 1

PseudoVC: Improving One-shot Voice Conversion with Pseudo Paired Data

Published: June 1, 2025 | arXiv ID: 2506.01039v1

By: Songjun Cao , Qinghua Wu , Jie Chen and more

Potential Business Impact:

Changes one person's voice to sound like another.

Business Areas:
Speech Recognition Data and Analytics, Software

As parallel training data is scarce for one-shot voice conversion (VC) tasks, waveform reconstruction is typically performed by various VC systems. A typical one-shot VC system comprises a content encoder and a speaker encoder. However, two types of mismatches arise: one for the inputs to the content encoder during training and inference, and another for the inputs to the speaker encoder. To address these mismatches, we propose a novel VC training method called \textit{PseudoVC} in this paper. First, we introduce an innovative information perturbation approach named \textit{Pseudo Conversion} to tackle the first mismatch problem. This approach leverages pretrained VC models to convert the source utterance into a perturbed utterance, which is fed into the content encoder during training. Second, we propose an approach termed \textit{Speaker Sampling} to resolve the second mismatch problem, which will substitute the input to the speaker encoder by another utterance from the same speaker during training. Experimental results demonstrate that our proposed \textit{Pseudo Conversion} outperforms previous information perturbation methods, and the overall \textit{PseudoVC} method surpasses publicly available VC models. Audio examples are available.

Page Count
5 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing