Score: 0

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Published: December 7, 2025 | arXiv ID: 2512.06663v1

By: Yu Qi , Yumeng Zhang , Chenting Gong and more

Potential Business Impact:

Helps AI see and count objects better.

Business Areas:

Image Recognition Data and Analytics, Software

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

CV and Pattern Recognition

Helps computers find any object, even hidden ones.

16 Oct 2025 2

91%

CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

CV and Pattern Recognition

Helps self-driving cars think step-by-step to drive safely.

27 Nov 2025 1

91%

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

CV and Pattern Recognition

Lets computers see and understand pictures better.

24 Nov 2025 0

View PDF Login to Bookmark

Page Count

13 pages

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Helps AI see and count objects better.

Technical Abstract

CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens