Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval
By: Xin Wang , Haipeng Zhang , Mang Li and more
Potential Business Impact:
Find images using text and a picture.
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.
Similar Papers
From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval
CV and Pattern Recognition
Finds pictures using a picture and words.
Data-Efficient Generalization for Zero-shot Composed Image Retrieval
CV and Pattern Recognition
Finds pictures using text and other pictures.
Zero Shot Composed Image Retrieval
CV and Pattern Recognition
Finds exact clothing in pictures using text.