Score: 0

Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints

Published: December 23, 2025 | arXiv ID: 2512.20781v1

By: Youjin Jung , Seongwoo Cho , Hyun-seok Min and more

Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL, a ZS-CIR retriever, SoFT raises R@5 to 65.25 on CIRR (+12.94), mAP@50 to 27.93 on CIRCO (+6.13), and R@50 to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

CV and Pattern Recognition

Find images using text and a picture.

1 Dec 2025 1

91%

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

CV and Pattern Recognition

Finds pictures using a picture and words.

25 Apr 2025 0

90%

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

CV and Pattern Recognition

Find images by showing one and describing changes.

25 Mar 2025 1

View PDF Login to Bookmark

Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints

Technical Abstract

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval