FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
By: Yiqun Yao , Xiang Li , Xin Jiang and more
Potential Business Impact:
Lets chatbots talk and listen at once.
Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce natural monologues, which are composed by continuous sentences and waiting intervals, mimicking humanoid cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning natural monologues with audio. To this end, we develop a dual training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our natural monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.
Similar Papers
FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
Sound
Lets computers talk and listen at once.
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Audio and Speech Processing
Makes computer voices have real conversations.
From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models
Computation and Language
Lets AI talk and listen at the same time.