Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
By: Yiqing Shi, Yiren Song, Mike Zheng Shou
Potential Business Impact:
Makes computers understand pictures better for tasks.
Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.
Similar Papers
3D-Consistent Multi-View Editing by Diffusion Guidance
CV and Pattern Recognition
Makes 3D pictures look right after editing.
From Editor to Dense Geometry Estimator
CV and Pattern Recognition
Makes computers understand 3D shapes from pictures.
Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
CV and Pattern Recognition
Edits pictures from many angles, all matching.