MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP
By: Aditya Chaudhary , Sneha Barman , Mainak Singha and more
In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.
Similar Papers
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
CV and Pattern Recognition
Helps computers understand satellite pictures better.
Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing
CV and Pattern Recognition
Guides AI to find specific things in pictures.
A Vision Centric Remote Sensing Benchmark
CV and Pattern Recognition
Helps computers understand satellite pictures better.