Score: 1

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Published: August 14, 2025 | arXiv ID: 2508.10566v1

By: Shiyu Liu , Kui Jiang , Xianming Liu and more

Potential Business Impact:

Makes computer faces talk smoothly and clearly.

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

CV and Pattern Recognition

Makes faces talk realistically from pictures.

27 Nov 2025 1

90%

MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding

CV and Pattern Recognition

Makes talking faces show real feelings from sound.

8 Jul 2025 1

89%

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

CV and Pattern Recognition

Makes still pictures talk and move like real people.

7 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

9 pages

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Makes computer faces talk smoothly and clearly.

Technical Abstract

IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis