Score: 2

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

Published: September 17, 2025 | arXiv ID: 2509.13939v1

By: Gia Khanh Nguyen, Yifeng Huang, Minh Hoai

Potential Business Impact:

Helps computers count specific things in pictures.

Business Areas:
Image Recognition Data and Analytics, Software

Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.

Repos / Data Links

Page Count
8 pages

Category
Computer Science:
CV and Pattern Recognition