Score: 0

KlingAvatar 2.0 Technical Report

Published: December 15, 2025 | arXiv ID: 2512.13313v1

By: Kling Team , Jialu Chen , Yikang Ding and more

Potential Business Impact:

Makes long, clear videos from your words.

Business Areas:

Motion Capture Media and Entertainment, Video

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

CV and Pattern Recognition

Makes talking cartoon characters act and feel real.

11 Sep 2025 0

93%

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

CV and Pattern Recognition

Makes talking cartoon characters act and feel real.

11 Sep 2025 0

90%

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

CV and Pattern Recognition

Helps robots understand videos by watching and listening.

5 Aug 2025 0

View PDF Login to Bookmark

Page Count

14 pages

KlingAvatar 2.0 Technical Report

Makes long, clear videos from your words.

Technical Abstract

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video