Score: 0

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

Published: June 1, 2025 | arXiv ID: 2506.01032v1

By: Pengyu Ren , Wenhao Guan , Kaidi Wang and more

Potential Business Impact:

Changes voices faster and better.

Business Areas:
Speech Recognition Data and Analytics, Software

In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show that ReFlow-VC performs exceptionally well in small datasets and zero-shot scenarios.

Country of Origin
🇨🇳 China

Page Count
5 pages

Category
Computer Science:
Sound