WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
By: Shenghao Fu , Yukun Su , Fengyun Rao and more
Potential Business Impact:
Finds any object in pictures using words.
Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.
Similar Papers
Towards 3D Objectness Learning in an Open World
CV and Pattern Recognition
Finds any object in 3D, even new ones.
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
CV and Pattern Recognition
Finds objects in 3D rooms without human labels.
Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
CV and Pattern Recognition
Teaches computers to find new things in pictures.