Score: 0

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

Published: January 11, 2026 | arXiv ID: 2601.06896v1

By: Mingyue Huo, Yiwen Shao, Yuheng Zhang

We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models "who spoke what and when" in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong end-to-end baselines, including Qwen-Omni and Gemini, particularly in handling complex speech overlaps. Moreover, TagSpeech employs a parameter-efficient training paradigm in which the LLM backbone is frozen and only lightweight projectors are trained, resulting in strong performance with low computational cost.

Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Computation and Language

Makes computers understand emotions in talking better.

25 Jul 2025 0

89%

TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models

Sound

Helps computers understand exact moments in audio.

14 Nov 2025 1

89%

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Sound

Identifies who spoke what in audio.

8 Aug 2025 1

View PDF Login to Bookmark

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

Technical Abstract

Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models