Score: 1

SoulX-LiveTalk Technical Report

Published: December 29, 2025 | arXiv ID: 2512.23379v1

By: Le Shen , Qiao Qian , Tan Yu and more

Potential Business Impact:

Makes digital people talk and move instantly.

Business Areas:

Speech Recognition Data and Analytics, Software

Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

CV and Pattern Recognition

Makes still pictures talk and move like real people.

15 Dec 2025 1

91%

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

CV and Pattern Recognition

Makes talking avatars move instantly.

4 Dec 2025 2

91%

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

CV and Pattern Recognition

Makes cartoon characters talk and move instantly.

4 Dec 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

SoulX-LiveTalk Technical Report

Makes digital people talk and move instantly.

Technical Abstract

Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length