Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking
By: Katia Vendrame , Bolaji Yusuf , Santosh Kesiraju and more
Potential Business Impact:
Lets computers understand spoken words in new situations.
End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.
Similar Papers
Interpretable and Robust Dialogue State Tracking via Natural Language Summarization with LLMs
Computation and Language
Helps chatbots understand what you're saying better.
Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models
Computation and Language
Makes computers understand many people talking at once.
The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Computation and Language
Lets computers understand long spoken chats better.