Score: 1

VCTR: A Transformer-Based Model for Non-parallel Voice Conversion

Published: October 14, 2025 | arXiv ID: 2510.12964v1

By: Maharnab Saikia

Potential Business Impact:

Changes voices without needing matching recordings.

Business Areas:
Speech Recognition Data and Analytics, Software

Non-parallel voice conversion aims to convert voice from a source domain to a target domain without paired training data. Cycle-Consistent Generative Adversarial Networks (CycleGAN) and Variational Autoencoders (VAE) have been used for this task, but these models suffer from difficult training and unsatisfactory results. Later, Contrastive Voice Conversion (CVC) was introduced, utilizing a contrastive learning-based approach to address these issues. However, these methods use CNN-based generators, which can capture local semantics but lacks the ability to capture long-range dependencies necessary for global semantics. In this paper, we propose VCTR, an efficient method for non-parallel voice conversion that leverages the Hybrid Perception Block (HPB) and Dual Pruned Self-Attention (DPSA) along with a contrastive learning-based adversarial approach. The code can be found in https://github.com/Maharnab-Saikia/VCTR.

Repos / Data Links

Page Count
7 pages

Category
Computer Science:
Sound