Score: 0

Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Published: November 25, 2025 | arXiv ID: 2511.20641v1

By: Wei Tang , Zuo-Zheng Wang , Kun Zhang and more

Potential Business Impact:

Helps computers see many things in pictures.

Business Areas:

Image Recognition Data and Analytics, Software

Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.

CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model

Machine Learning (CS)

Teaches computers to learn from messy, uneven data.

10 Mar 2025 1

91%

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

CV and Pattern Recognition

Helps computers understand pictures in many languages.

17 Nov 2025 0

91%

Enhancing CLIP Robustness via Cross-Modality Alignment

CV and Pattern Recognition

Protects AI from tricky fake pictures.

28 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

14 pages

Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Helps computers see many things in pictures.

Technical Abstract

CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Enhancing CLIP Robustness via Cross-Modality Alignment