StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation
By: Xi Chen, Yuchen Song, Satoshi Nakamura
Potential Business Impact:
Translates voices while keeping their emotion.
We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the "LLM-as-Judge" for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.
Similar Papers
StressTest: Can YOUR Speech LM Handle the Stress?
Computation and Language
Helps computers understand meaning from spoken emphasis.
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation
Computation and Language
Translates voices with emotions, not just words.
UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
Sound
Translates voices, keeping the original emotion and sound.