ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing visual question answering (VQA) models are constrained by closed-answer vocabularies, rendering them inadequate for open-ended, unseen-category natural language queries in post-disaster assessment—necessitating frequent re-annotation and task-specific fine-tuning. To address this, we propose the first zero-shot disaster VQA framework tailored to remote sensing and street-view imagery of floods and other disasters. Our method adapts frozen large-scale vision-language models (e.g., FLAVA, BLIP-2) via prompt engineering and cross-modal alignment, enabling open-ended answer generation without parameter updates. We further introduce a semantic answer mapping module and a confidence-based answer reranking mechanism to overcome predefined answer space limitations. Evaluated on FloodNet, our zero-shot approach achieves 68.3% accuracy—surpassing supervised baselines by 12.7 percentage points—while supporting arbitrary novel questions and answer types. Deployment efficiency improves by over an order of magnitude compared to fine-tuning–based methods.

Technology Category

Application Category

📝 Abstract

Natural disasters usually affect vast areas and devastate infrastructures. Performing a timely and efficient response is crucial to minimize the impact on affected communities, and data-driven approaches are the best choice. Visual question answering (VQA) models help management teams to achieve in-depth understanding of damages. However, recently published models do not possess the ability to answer open-ended questions and only select the best answer among a predefined list of answers. If we want to ask questions with new additional possible answers that do not exist in the predefined list, the model needs to be fin-tuned/retrained on a new collected and annotated dataset, which is a time-consuming procedure. In recent years, large-scale Vision-Language Models (VLMs) have earned significant attention. These models are trained on extensive datasets and demonstrate strong performance on both unimodal and multimodal vision/language downstream tasks, often without the need for fine-tuning. In this paper, we propose a VLM-based zero-shot VQA (ZeShot-VQA) method, and investigate the performance of on post-disaster FloodNet dataset. Since the proposed method takes advantage of zero-shot learning, it can be applied on new datasets without fine-tuning. In addition, ZeShot-VQA is able to process and generate answers that has been not seen during the training procedure, which demonstrates its flexibility.

Problem

Research questions and friction points this paper is trying to address.

Assessing natural disaster damage with zero-shot VQA

Eliminating need for retraining on new answer types

Enabling open-ended question answering in damage assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot VQA method without fine-tuning

Answer mapping for unseen questions

VLM-based framework for disaster assessment

🔎 Similar Papers

Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model