VALA: Learning Latent Anchors for Training-Free and Temporally Consistent
By: Zhangkai Wu , Xuhui Fan , Zhongyuan Xie and more
Potential Business Impact:
Makes videos edit better and faster.
Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.
Similar Papers
Language as an Anchor: Preserving Relative Visual Geometry for Domain Incremental Learning
CV and Pattern Recognition
Keeps AI smart when learning new things.
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
CV and Pattern Recognition
Helps computers understand many pictures better.
AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
CV and Pattern Recognition
Changes 3D shapes with words, keeping them stable.