Score: 2

CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Published: October 16, 2025 | arXiv ID: 2510.14792v1

By: Hojun Choi , Youngsun Lim , Jaeyo Shin and more

Potential Business Impact:

Helps computers find any object, even hidden ones.

Business Areas:

Image Recognition Data and Analytics, Software

Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

CV and Pattern Recognition

Helps AI see and count objects better.

7 Dec 2025 0

90%

Latent Chain-of-Thought for Visual Reasoning

Artificial Intelligence

Makes AI think step-by-step better for new problems.

27 Oct 2025 2

90%

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

CV and Pattern Recognition

Lets computers see and understand pictures better.

24 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 🇺🇸 United States, Korea, Republic of

Page Count

28 pages

CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Helps computers find any object, even hidden ones.

Technical Abstract

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Latent Chain-of-Thought for Visual Reasoning

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens