Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
By: Jiangning Zhang , Junwei Zhu , Zhenye Gan and more
Potential Business Impact:
Makes still pictures talk and move like real people.
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/
Similar Papers
Panel-by-Panel Souls: A Performative Workflow for Expressive Faces in AI-Assisted Manga Creation
Human-Computer Interaction
Lets artists draw manga characters with real emotions.
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
CV and Pattern Recognition
Makes video characters act with real feelings.
InfinityHuman: Towards Long-Term Audio-Driven Human
CV and Pattern Recognition
Makes talking people in videos look real.