I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
By: Esam Ghaleb , Bulat Khaertdinov , Aslı Özyürek and more
Potential Business Impact:
Computers understand what you point at while talking.
In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
Similar Papers
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Computation and Language
Helps computers understand what you mean in chats.
Large Language Models for Virtual Human Gesture Selection
Human-Computer Interaction
Makes virtual characters gesture naturally when they talk.
Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
Computation and Language
Helps computers understand talking by watching hand movements.