S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning
By: Yu Pan , Yuguang Yang , Yanni Hu and more
Potential Business Impact:
Translates spoken words between languages instantly.
Despite recent advances in multilingual speech-to-speech translation (S2ST), several critical challenges persist: 1) achieving high-quality translation remains a major hurdle, and 2) most existing methods heavily rely on large-scale parallel speech corpora, which are costly and difficult to obtain. To address these issues, we propose \textit{S2ST-Omni}, an efficient and scalable framework for multilingual S2ST. Specifically, we decompose the S2ST task into speech-to-text translation (S2TT) and text-to-speech synthesis (TTS). For S2TT, we propose an effective speech language model that integrates the pretrained Whisper encoder for robust audio understanding and Qwen 3.0 for advanced text comprehension. A lightweight speech adapter is employed to bridge the modality gap between speech and text representations. To further facilitate the multimodal knowledge learning, a two-stage fine-tuning strategy is introduced. In the TTS stage, we adopt a streaming autoregressive generation approach to produce natural and fluent target speech. Experiments on the CVSS benchmark show that S2ST-Omni consistently outperforms existing state-of-the-art S2ST systems in translation quality, highlighting its effectiveness and superiority.
Similar Papers
OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
Computation and Language
Translates speech and images to text faster.
UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
Sound
Translates voices, keeping the original emotion and sound.
RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
Audio and Speech Processing
Translates speech directly without needing speech pairs.