🤖 AI Summary
In post-mining-disaster underground environments—characterized by darkness, high dust concentration, and structural collapse—conventional vision-based perception fails, severely degrading situational awareness. Method: This paper proposes MDSE, a multimodal vision-language framework that achieves robust image-text alignment and interpretable, fine-grained scene description generation under severe visual degradation. MDSE innovatively integrates (1) context-aware cross-modal attention, (2) segmentation-guided dual-path visual encoding, and (3) a lightweight Transformer-based language model. Contribution/Results: Evaluated on UMD—the first real-world mining-disaster image-description dataset—MDSE significantly outperforms existing vision-language models in caption accuracy, contextual relevance, and reliable identification of critical elements (e.g., obstacles, personnel status, spatial structure). It delivers trustworthy, semantics-level situational awareness for emergency response operations.
📝 Abstract
Underground mining disasters produce pervasive darkness, dust, and collapses that obscure vision and make situational awareness difficult for humans and conventional systems. To address this, we propose MDSE, Multimodal Disaster Situation Explainer, a novel vision-language framework that automatically generates detailed textual explanations of post-disaster underground scenes. MDSE has three-fold innovations: (i) Context-Aware Cross-Attention for robust alignment of visual and textual features even under severe degradation; (ii) Segmentation-aware dual pathway visual encoding that fuses global and region-specific embeddings; and (iii) Resource-Efficient Transformer-Based Language Model for expressive caption generation with minimal compute cost. To support this task, we present the Underground Mine Disaster (UMD) dataset--the first image-caption corpus of real underground disaster scenes--enabling rigorous training and evaluation. Extensive experiments on UMD and related benchmarks show that MDSE substantially outperforms state-of-the-art captioning models, producing more accurate and contextually relevant descriptions that capture crucial details in obscured environments, improving situational awareness for underground emergency response. The code is at https://github.com/mizanJewel/Multimodal-Disaster-Situation-Explainer.