RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios
By: Tianyi Zhao , Jiawen Xi , Linhui Xiao and more
Potential Business Impact:
Helps computers find things in pictures, even at night.
Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.
Similar Papers
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
CV and Pattern Recognition
Helps computers find things in videos using words.
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
CV and Pattern Recognition
Helps robots understand what to do in a room.
RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models
CV and Pattern Recognition
Tests if computers can see with heat.