Score: 0

FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

Published: August 25, 2025 | arXiv ID: 2508.17868v1

By: Takuhiro Kaneko , Hirokazu Kameoka , Kou Tanaka and more

Potential Business Impact:

Makes voices change much faster.

Business Areas:

Speech Recognition Data and Analytics, Software

A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.

LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models

Sound

Changes voices to sound like someone else.

10 Sep 2025 1

87%

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

Sound

Changes voices faster and better.

1 Jun 2025 0

87%

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Sound

Changes your voice to sound like anyone else.

2 Jan 2025 0

View PDF Login to Bookmark

Page Count

5 pages

FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

Makes voices change much faster.

Technical Abstract

LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

AdaptVC: High Quality Voice Conversion with Adaptive Learning