Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
By: Marc-André Carbonneau , Benjamin van Niekerk , Hugo Seuté and more
Potential Business Impact:
Makes cloned voices sound more like real people.
Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.
Similar Papers
You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks
Audio and Speech Processing
Makes it harder to hide who is talking.
Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems
Sound
Identifies people by how they talk.
Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR
Audio and Speech Processing
Helps computers understand speech from people with speech problems.