Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In post-mining-disaster underground environments—characterized by darkness, high dust concentration, and structural collapse—conventional vision-based perception fails, severely degrading situational awareness. Method: This paper proposes MDSE, a multimodal vision-language framework that achieves robust image-text alignment and interpretable, fine-grained scene description generation under severe visual degradation. MDSE innovatively integrates (1) context-aware cross-modal attention, (2) segmentation-guided dual-path visual encoding, and (3) a lightweight Transformer-based language model. Contribution/Results: Evaluated on UMD—the first real-world mining-disaster image-description dataset—MDSE significantly outperforms existing vision-language models in caption accuracy, contextual relevance, and reliable identification of critical elements (e.g., obstacles, personnel status, spatial structure). It delivers trustworthy, semantics-level situational awareness for emergency response operations.

Technology Category

Application Category

📝 Abstract
Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.
Problem

Research questions and friction points this paper is trying to address.

Generates textual explanations for obscured underground mining disaster scenes
Improves situational awareness in dark, dusty, collapsed mining environments
Aligns visual and textual features robustly under severe degradation conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware Cross-Attention aligns visual and textual features under degradation
Segmentation-aware dual pathway visual encoding fuses global and region-specific embeddings
Resource-Efficient Transformer-Based Language Model generates expressive captions with minimal compute cost
🔎 Similar Papers
No similar papers found.
M
Mizanur Rahman Jewel
Missouri University of Science and Technology
Mohamed Elmahallawy
Mohamed Elmahallawy
Assistant Professor, Department of Computer Science and Cybersecurity at Washington StateUniversity
Machine/Federeated LearningCybersecurityCryptographyTrustworthy AI
S
Sanjay Madria
Missouri University of Science and Technology
S
Samuel Frimpong
Missouri University of Science and Technology