CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval
By: Bin Kang , Bin Chen , Junjie Wang and more
Potential Business Impact:
Finds better pictures from text searches.
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP
Similar Papers
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
CV and Pattern Recognition
Helps computers see and understand any object.
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
CV and Pattern Recognition
Helps computers understand pictures better by focusing on important parts.
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
CV and Pattern Recognition
Helps computers see and understand objects better.