Modeling Turn-Taking with Semantically Informed Gestures
By: Varsha Suresh , M. Hamza Mughal , Christian Theobalt and more
Potential Business Impact:
Helps computers know when to speak in talks.
In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
Similar Papers
Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay
Human-Computer Interaction
Helps game characters know when to talk.
Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
Computation and Language
Helps computers understand talking by watching hand movements.
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Computation and Language
Helps computers understand when to talk in conversations.