Score: 2

DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Published: August 8, 2025 | arXiv ID: 2508.05978v1

By: Wei Chen , Binzhu Sha , Dan Luo and more

Potential Business Impact:

Changes one person's singing voice to another's.

Singing Voice Conversion (SVC) transfers a source singer's timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self-supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity and naturalness, outperforming state-of-the-art methods in both subjective and objective evaluations.

Country of Origin
🇨🇳 China


Page Count
5 pages

Category
Computer Science:
Sound