BachVid: Training-Free Video Generation with Consistent Background and Character
By: Han Yan , Xibin Song , Yifu Wang and more
Potential Business Impact:
Makes videos with same people and places.
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.
Similar Papers
Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
CV and Pattern Recognition
Makes videos of people talking from sound.
GenCompositor: Generative Video Compositing with Diffusion Transformer
CV and Pattern Recognition
Lets you easily add video clips into movies.
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks
CV and Pattern Recognition
Makes computers create longer, smoother videos faster.