LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
By: Wenhui Song , Hanhui Li , Jiehui Huang and more
Potential Business Impact:
Makes videos keep the same person's face.
In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.
Similar Papers
Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization
CV and Pattern Recognition
Makes faces look different but still real.
Identity Preserving Latent Diffusion for Brain Aging Modeling
Graphics
Changes brain scans to show aging, keeping person the same.
FLUID: Training-Free Face De-identification via Latent Identity Substitution
CV and Pattern Recognition
Changes faces in pictures without losing details.