RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
By: Zhisheng Zheng , Xiaohang Sun , Tuan Dinh and more
Potential Business Impact:
Translates speech directly without needing speech pairs.
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
Similar Papers
Direct Speech to Speech Translation: A Review
Computation and Language
Translates spoken words instantly between languages.
S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning
Audio and Speech Processing
Translates spoken words between languages instantly.
Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
Computation and Language
Translates Persian speech to English speech directly.