Discrete-time diffusion-like models for speech synthesis
By: Xiaozhou Tan , Minghui Zhao , Mattias Cross and more
Potential Business Impact:
Makes computers create speech more efficiently.
Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.
Similar Papers
Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency
Audio and Speech Processing
Cleans up noisy audio instantly for calls.
The Diffusion Duality
Machine Learning (CS)
Makes computers write stories much faster.
Generative modelling with jump-diffusions
Machine Learning (CS)
Makes AI create more realistic pictures and sounds.