Score: 0

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Published: November 18, 2025 | arXiv ID: 2511.14223v1

By: Yifan Yang , Zhi Cen , Sida Peng and more

Potential Business Impact:

Makes talking faces move in real-time.

Business Areas:

Speech Recognition Data and Analytics, Software

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

CV and Pattern Recognition

Makes computer faces talk in real-time.

18 Nov 2025 0

93%

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

CV and Pattern Recognition

Makes talking cartoon characters move their mouths.

27 Feb 2025 0

92%

StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

CV and Pattern Recognition

Makes digital people move and talk live.

26 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

13 pages

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Makes talking faces move in real-time.

Technical Abstract

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars