Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
By: Anna Deichler, Jonas Beskow
Potential Business Impact:
Helps robots understand where to look and what to say.
We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
Similar Papers
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
CV and Pattern Recognition
Helps robots understand what you're pointing at.
The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
Multimedia
Shows how things look from many viewpoints.
Visual Grounding from Event Cameras
CV and Pattern Recognition
Lets computers understand spoken words about moving things.