Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems
By: Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi
Potential Business Impact:
Identifies people by how they talk.
The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.
Similar Papers
Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
Sound
Makes cloned voices sound more like real people.
Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models
Computation and Language
Tracks how ideas change in stories over time.
Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech
Audio and Speech Processing
Helps computers hear when people are stressed.