Score: 0

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

Published: April 25, 2025 | arXiv ID: 2504.17990v1

By: Yabing Wang , Zhuotao Tian , Qingpei Guo and more

Potential Business Impact:

Finds pictures using a picture and words.

Business Areas:

Visual Search Internet Services

Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.

Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

CV and Pattern Recognition

Lets computers find images using text changes.

1 Apr 2025 2

93%

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

CV and Pattern Recognition

Find images by showing one and describing changes.

25 Mar 2025 1

93%

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

CV and Pattern Recognition

Find images using text and a picture.

1 Dec 2025 1

View PDF Login to Bookmark

Page Count

10 pages

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

Finds pictures using a picture and words.

Technical Abstract

Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval