Score: 0

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Published: December 9, 2025 | arXiv ID: 2512.09092v1

By: Mizanur Rahman Jewel , Mohamed Elmahallawy , Sanjay Madria and more

Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

CV and Pattern Recognition

Spots danger in crowds before it happens.

10 Dec 2025 1

87%

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

CV and Pattern Recognition

Helps AI remember new things without forgetting old ones.

23 Nov 2025 1

87%

BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Machine Learning (CS)

Helps predict disasters faster using text and pictures.

26 Nov 2025 0

View PDF Login to Bookmark

Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

Technical Abstract

Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla