CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
By: Behnam Raoufi , Hossein Sharify , Mohamad Mahdee Ramezanee and more
Potential Business Impact:
Helps computers see and name objects better.
Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.
Similar Papers
SuperCLIP: CLIP with Simple Classification Supervision
CV and Pattern Recognition
Makes computers understand pictures and words better.
uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
CV and Pattern Recognition
Helps computers understand pictures in many languages.
Contrastive vision-language learning with paraphrasing and negation
CV and Pattern Recognition
Teaches computers to understand words that change meaning.