Are Image-to-Video Models Good Zero-Shot Image Editors?
By: Zechuan Zhang , Zhenyuan Chen , Zongxin Yang and more
Potential Business Impact:
Changes pictures using spoken words.
Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.
Similar Papers
Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence
CV and Pattern Recognition
Makes videos look smooth and real, not jumpy.
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
CV and Pattern Recognition
Makes AI better at changing pictures with words.
Unified Video Editing with Temporal Reasoner
CV and Pattern Recognition
Edits videos precisely without needing masks.