Score: 0

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

Published: December 21, 2025 | arXiv ID: 2512.18706v1

By: Zhanxun Liu , Yifan Duan , Mengmeng Wang and more

We present X-Talk, an open-source framework that champions a decoupled, modular design for LLM-driven speech-to-speech (S2S) systems. While the dominant trend favors end-to-end (E2E) modeling to optimize information flow, these "omni-models" often struggle to balance the competing objectives of complex speech tasks within a single network. X-Talk challenges this paradigm by demonstrating that a systematically optimized cascaded pipeline can achieve sub-second latency without sacrificing modular flexibility. Our framework seamlessly integrates specialized front-end components (e.g., VAD, speech enhancement) and diverse understanding models (e.g., ASR, emotion, and environmental sound analysis) with LLM capabilities like retrieval-augmented generation (RAG) and tool use. By revitalizing the cascaded approach, X-Talk highlights the underestimated potential of modular S2S systems and provides a robust foundation for future research and applications.

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Human-Computer Interaction

Makes talking robots sound more natural.

12 Dec 2025 0

89%

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Human-Computer Interaction

Makes talking robots sound more natural.

12 Dec 2025 0

88%

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Audio and Speech Processing

Makes computers talk like real people, not robots.

1 Jan 2025 1

View PDF Login to Bookmark

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

Technical Abstract

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation