Score: 0

RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Published: December 31, 2025 | arXiv ID: 2512.24561v1

By: Tianyi Zhao , Jiawen Xi , Linhui Xiao and more

Potential Business Impact:

Helps computers find things in pictures, even at night.

Business Areas:

Visual Search Internet Services

Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

CV and Pattern Recognition

Helps computers find things in videos using words.

21 Nov 2025 0

90%

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

CV and Pattern Recognition

Helps robots understand what to do in a room.

3 Dec 2025 1

89%

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

CV and Pattern Recognition

Tests if computers can see with heat.

25 Mar 2025 1

View PDF Login to Bookmark

Page Count

27 pages

RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

Helps computers find things in pictures, even at night.

Technical Abstract

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models