Score: 2

HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models

Published: November 27, 2025 | arXiv ID: 2511.22594v1

By: Haoxi Zeng , Haoxuan Li , Yi Bin and more

Potential Business Impact:

Helps computers understand pictures and words better.

Business Areas:

Image Recognition Data and Analytics, Software

Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.

HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

CV and Pattern Recognition

Helps computers understand detailed picture descriptions better.

10 Nov 2025 0

91%

MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

CV and Pattern Recognition

Helps computers understand pictures and long stories.

8 Dec 2025 0

91%

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

CV and Pattern Recognition

Teaches computers to understand complex picture meanings.

28 Nov 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

13 pages

HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models

Helps computers understand pictures and words better.

Technical Abstract

HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

PowerCLIP: Powerset Alignment for Contrastive Pre-Training