Score: 0

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Published: December 12, 2025 | arXiv ID: 2512.11724v1

By: Titaya Mairittha , Tanakon Sawanglok , Panuwit Raden and more

Potential Business Impact:

Makes talking robots sound more natural.

Business Areas:

Speech Recognition Data and Analytics, Software

While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Computation and Language

AI answers questions faster by guessing what you'll ask.

2 Oct 2025 1

89%

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Computation and Language

Lets talking computers use outside information.

27 Apr 2025 2

87%

Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality

Human-Computer Interaction

Creates 3D objects from words and pictures.

17 Aug 2025 1

View PDF Login to Bookmark

Page Count

6 pages

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Makes talking robots sound more natural.

Technical Abstract

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Say It, See It: A Systematic Evaluation on Speech-Based 3D Content Generation Methods in Augmented Reality