Score: 2

DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Published: March 23, 2025 | arXiv ID: 2503.18159v1

By: Peng Chen , Xiaobao Wei , Ming Lu and more

Potential Business Impact:

Makes cartoon faces talk like real people.

Business Areas:

Speech Recognition Data and Analytics, Software

Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

CV and Pattern Recognition

Makes computer faces talk in real-time.

18 Nov 2025 0

90%

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

CV and Pattern Recognition

Makes talking faces look real and move smoothly.

24 Mar 2025 0

90%

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

CV and Pattern Recognition

Makes talking faces move in real-time.

18 Nov 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

8 pages

DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Makes cartoon faces talk like real people.

Technical Abstract

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model