LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models
By: Hirokazu Kameoka , Takuhiro Kaneko , Kou Tanaka and more
Potential Business Impact:
Changes voices to sound like someone else.
Previously, we introduced VoiceGrad, a nonparallel voice conversion (VC) technique enabling mel-spectrogram conversion from source to target speakers using a score-based diffusion model. The concept involves training a score network to predict the gradient of the log density of mel-spectrograms from various speakers. VC is executed by iteratively adjusting an input mel-spectrogram until resembling the target speaker's. However, challenges persist: audio quality needs improvement, and conversion is slower compared to modern VC methods designed to operate at very high speeds. To address these, we introduce latent diffusion models into VoiceGrad, proposing an improved version with reverse diffusion in the autoencoder bottleneck. Additionally, we propose using a flow matching model as an alternative to the diffusion model to further speed up the conversion process without compromising the conversion quality. Experimental results show enhanced speech quality and accelerated conversion compared to the original.
Similar Papers
FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
Sound
Makes voices change much faster.
ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization
Sound
Changes voices faster and better.
Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model
Sound
Separates singing voice from music perfectly.