Score: 0

The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

Published: December 23, 2025 | arXiv ID: 2512.20340v1

By: Qingdong He , Xueqin Chen , Yanjie Pan and more

Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

CV and Pattern Recognition

Lets you try on clothes in videos realistically.

4 Aug 2025 2

89%

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

CV and Pattern Recognition

Makes online clothes look real in videos.

24 Nov 2025 1

89%

ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On

CV and Pattern Recognition

Makes clothes look real when you try them on virtually.

6 Jun 2025 0

View PDF Login to Bookmark

The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection

Technical Abstract

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

Eevee: Towards Close-up High-resolution Video-based Virtual Try-on

ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On