Score: 2

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

Published: April 16, 2025 | arXiv ID: 2504.11809v1

By: Biao Fu , Donglei Yu , Minpeng Liao and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Translates talking instantly, faster and smarter.

Business Areas:

Translation Service Professional Services

Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Although large language models (LLMs) have showcased strong capabilities in offline translation tasks, applying them to SimulST poses notable challenges. Existing LLM-based SimulST approaches either incur significant computational overhead due to repeated encoding of bidirectional speech encoder, or they depend on a fixed read/write policy, limiting the efficiency and performance. In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. EASiST includes a multi-latency data curation strategy to generate semantically aligned SimulST training samples and redefines SimulST as an interleaved generation task with explicit read/write tokens. To facilitate adaptive inference, we incorporate a lightweight policy head that dynamically predicts read/write actions. Additionally, we employ a multi-stage training strategy to align speech-text modalities and optimize both translation and policy behavior. Experiments on the MuST-C En$\rightarrow$De and En$\rightarrow$Es datasets demonstrate that EASiST offers superior latency-quality trade-offs compared to several strong baselines.

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation

Computation and Language

Translates talking instantly, like a real-time interpreter.

22 Apr 2025 1

89%

Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation

Computation and Language

Translates many languages spoken at once.

14 Mar 2025 1

89%

Direct Simultaneous Translation Activation for Large Audio-Language Models

Sound

Translates talking instantly, even mid-sentence.

19 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

13 pages

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

Translates talking instantly, faster and smarter.

Technical Abstract

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation

Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation

Direct Simultaneous Translation Activation for Large Audio-Language Models