GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing
By: Zilun Zhang , Haozhan Shen , Tiancheng Zhao and more
Potential Business Impact:
Helps computers understand satellite pictures better.
The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel in Referring Expression Comprehension (REC), struggle with tasks involving complex instructions (e.g., exists multiple conditions) or pixel-level operations like segmentation and change detection. In this white paper, we provide a comprehensive hierarchical summary of vision-language tasks in RS, categorized by the varying levels of cognitive capability required. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) with increased difficulty, and Visual Question Answering (VQA) aloneside. Moreover, we propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition parser and a self-augmentation strategy based on cyclic referring. These features are integrated into the GeoRSMLLM model, and this enhanced model is designed to handle a broad range of tasks of RSVLTS, paving the way for a more generalized solution for vision-language tasks in geoscience and remote sensing.
Similar Papers
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
CV and Pattern Recognition
Lets computers understand many satellite picture jobs.
A Vision Centric Remote Sensing Benchmark
CV and Pattern Recognition
Helps computers understand satellite pictures better.
VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
CV and Pattern Recognition
Maps can now understand satellite pictures and text.