Score: 1

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Published: December 11, 2025 | arXiv ID: 2512.10943v1

By: Sharath Girish , Viacheslav Ivanov , Tsai-Shien Chen and more

Potential Business Impact:

Controls when people appear and disappear in videos.

Business Areas:

Motion Capture Media and Entertainment, Video

Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

TempoControl: Temporal Attention Guidance for Text-to-Video Models

CV and Pattern Recognition

Controls when things happen in AI-made videos.

2 Oct 2025 1

87%

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

CV and Pattern Recognition

Makes videos move exactly how you want.

9 Nov 2025 0

87%

AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency

CV and Pattern Recognition

Makes videos from words that look real.

30 Oct 2025 2

View PDF Login to Bookmark

Page Count

22 pages

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Controls when people appear and disappear in videos.

Technical Abstract

TempoControl: Temporal Attention Guidance for Text-to-Video Models

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency