Score: 2

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Published: August 6, 2025 | arXiv ID: 2508.04028v1

By: Yifan Wang , Tao Wang , Chenwei Tang and more

Potential Business Impact:

Finds exact pictures for text descriptions.

Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.

LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

CV and Pattern Recognition

Fixes computer vision bias in messy data.

21 Aug 2025 2

88%

AttriPrompt: Dynamic Prompt Composition Learning for CLIP

CV and Pattern Recognition

Helps computers understand pictures better.

7 Sep 2025 1

87%

Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models

CV and Pattern Recognition

Makes medical AI fair for everyone.

26 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇸🇬 🇨🇳 China, Singapore

Repos / Data Links

github.com

Page Count

10 pages

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Finds exact pictures for text descriptions.

Technical Abstract

LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

AttriPrompt: Dynamic Prompt Composition Learning for CLIP

Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models