Score: 0

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Published: September 25, 2025 | arXiv ID: 2509.21144v1

By: Sitong Cheng , Weizhen Bian , Xinsheng Wang and more

Potential Business Impact:

Translates voices, keeping the original emotion and sound.

Business Areas:

Translation Service Professional Services

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.

Direct Speech to Speech Translation: A Review

Computation and Language

Translates spoken words instantly between languages.

3 Mar 2025 0

89%

S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

Audio and Speech Processing

Translates spoken words between languages instantly.

11 Jun 2025 1

89%

Language translation, and change of accent for speech-to-speech task using diffusion model

Computation and Language

Translates languages and changes voices at once.

4 May 2025 0

View PDF Login to Bookmark

Page Count

22 pages

UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Translates voices, keeping the original emotion and sound.

Technical Abstract

Direct Speech to Speech Translation: A Review

S2ST-Omni: An Efficient Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Progressive Fine-tuning

Language translation, and change of accent for speech-to-speech task using diffusion model