Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
By: Jiamin Xie , Ju Lin , Yiteng Huang and more
Potential Business Impact:
Lets glasses hear who is talking where.
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.
Similar Papers
Spatial Audio Processing with Large Language Model on Wearable Devices
Sound
Listens to where sounds come from.
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Audio and Speech Processing
Makes computers understand spoken words better.
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Audio and Speech Processing
Makes computer voices have real conversations.