Score: 2

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Published: June 3, 2025 | arXiv ID: 2506.03099v1

By: Chetwin Low, Weimin Wang

BigTech Affiliations: Character AI

Potential Business Impact:

Makes cartoon characters talk and move like real people.

Business Areas:

Video Chat Information Technology, Internet Services, Messaging and Telecommunications

In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

CV and Pattern Recognition

Makes computer faces talk in real-time.

18 Nov 2025 0

91%

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

CV and Pattern Recognition

Makes talking faces move in real-time.

18 Nov 2025 0

90%

LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

CV and Pattern Recognition

Makes talking avatars move realistically and fast.

6 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

12 pages

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Makes cartoon characters talk and move like real people.

Technical Abstract

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models