VCTR: A Transformer-Based Model for Non-parallel Voice Conversion
By: Maharnab Saikia
Potential Business Impact:
Changes voices without needing matching recordings.
Non-parallel voice conversion aims to convert voice from a source domain to a target domain without paired training data. Cycle-Consistent Generative Adversarial Networks (CycleGAN) and Variational Autoencoders (VAE) have been used for this task, but these models suffer from difficult training and unsatisfactory results. Later, Contrastive Voice Conversion (CVC) was introduced, utilizing a contrastive learning-based approach to address these issues. However, these methods use CNN-based generators, which can capture local semantics but lacks the ability to capture long-range dependencies necessary for global semantics. In this paper, we propose VCTR, an efficient method for non-parallel voice conversion that leverages the Hybrid Perception Block (HPB) and Dual Pruned Self-Attention (DPSA) along with a contrastive learning-based adversarial approach. The code can be found in https://github.com/Maharnab-Saikia/VCTR.
Similar Papers
Pureformer-VC: Non-parallel Voice Conversion with Pure Stylized Transformer Blocks and Triplet Discriminative Training
Sound
Changes voices to sound like anyone else.
Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion
Sound
Makes computer voices sound more real.
Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements
Sound
Changes one voice to sound like another.