Score: 0

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Published: December 17, 2025 | arXiv ID: 2512.15340v1

By: Junjie Chen , Fei Wang , Zhihao Huang and more

Potential Business Impact:

Makes avatars talk and move like real people.

Business Areas:

Motion Capture Media and Entertainment, Video

Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.

Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay

Human-Computer Interaction

Helps game characters know when to talk.

5 Feb 2025 1

88%

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

CV and Pattern Recognition

Makes faces talk realistically from pictures.

27 Nov 2025 1

88%

EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

Audio and Speech Processing

Makes talking robots show real feelings.

25 Aug 2025 0

View PDF Login to Bookmark

Page Count

16 pages

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Makes avatars talk and move like real people.

Technical Abstract

Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

EAI-Avatar: Emotion-Aware Interactive Talking Head Generation