Score: 1

Multi-human Interactive Talking Dataset

Published: August 5, 2025 | arXiv ID: 2508.03050v1

By: Zeyu Zhu, Weijia Wu, Mike Zheng Shou

Potential Business Impact:

Makes videos of many people talking together.

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

CV and Pattern Recognition

Makes videos of many people talking from one person.

28 Nov 2025 1

90%

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

CV and Pattern Recognition

Makes talking videos look real for everyone.

19 Aug 2025 2

88%

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

CV and Pattern Recognition

Creates talking videos from sound, pose, and text.

26 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

13 pages

Multi-human Interactive Talking Dataset

Makes videos of many people talking together.

Technical Abstract

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation