Score: 0

Direct Simultaneous Translation Activation for Large Audio-Language Models

Published: September 19, 2025 | arXiv ID: 2509.15692v1

By: Pei Zhang , Yiming Wang , Jialong Tang and more

Potential Business Impact:

Translates talking instantly, even mid-sentence.

Business Areas:

Translation Service Professional Services

Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural changes. In this paper, we introduce {\bf Simul}taneous {\bf S}elf-{\bf A}ugmentation ({\bf SimulSA}), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data by randomly truncating speech and constructing partially aligned translation. By incorporating them into offline SFT data, SimulSA effectively bridges the distribution gap between offline translation during pretraining and simultaneous translation during inference. Experimental results demonstrate that augmenting only about {\bf 1\%} of the simultaneous data, compared to the full offline SFT data, can significantly activate LALMs' Simul-S2TT capabilities without modifications to model architecture or decoding strategy.

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation

Computation and Language

Translates talking instantly, like a real-time interpreter.

22 Apr 2025 1

89%

Language translation, and change of accent for speech-to-speech task using diffusion model

Computation and Language

Translates languages and changes voices at once.

4 May 2025 0

89%

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture

Computation and Language

Translates talking instantly, faster and smarter.

16 Apr 2025 2

View PDF Login to Bookmark

Page Count

5 pages

Direct Simultaneous Translation Activation for Large Audio-Language Models

Translates talking instantly, even mid-sentence.

Technical Abstract

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation

Language translation, and change of accent for speech-to-speech task using diffusion model

Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture