Explaining the Unseen: Multimodal Vision-Language Reasoning for Situational Awareness in Underground Mining Disasters

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

In post-mining-disaster underground environments—characterized by darkness, high dust concentration, and structural collapse—conventional vision-based perception fails, severely degrading situational awareness. Method: This paper proposes MDSE, a multimodal vision-language framework that achieves robust image-text alignment and interpretable, fine-grained scene description generation under severe visual degradation. MDSE innovatively integrates (1) context-aware cross-modal attention, (2) segmentation-guided dual-path visual encoding, and (3) a lightweight Transformer-based language model. Contribution/Results: Evaluated on UMD—the first real-world mining-disaster image-description dataset—MDSE significantly outperforms existing vision-language models in caption accuracy, contextual relevance, and reliable identification of critical elements (e.g., obstacles, personnel status, spatial structure). It delivers trustworthy, semantics-level situational awareness for emergency response operations.

Technology Category

Application Category

📝 Abstract

Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.

Problem

Research questions and friction points this paper is trying to address.

Generates textual explanations for obscured underground mining disaster scenes

Improves situational awareness in dark, dusty, collapsed mining environments

Aligns visual and textual features robustly under severe degradation conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware Cross-Attention aligns visual and textual features under degradation

Segmentation-aware dual pathway visual encoding fuses global and region-specific embeddings

Resource-Efficient Transformer-Based Language Model generates expressive captions with minimal compute cost

🔎 Similar Papers

No similar papers found.