AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of ambiguity in real-world visual question answering (VQA), where existing systems often fail to respond appropriately. We propose the first four-level taxonomy for ambiguous VQA and introduce AQuA, a fine-grained annotated dataset that explicitly maps each ambiguity type to an optimal response strategy—such as providing a direct answer, inferring user intent, enumerating possible answers, or requesting clarification. Building on this framework, we develop a strategy-aware response generation model that fine-tunes a vision-language foundation model to automatically recognize ambiguity types and produce contextually appropriate responses. Experimental results demonstrate that our approach significantly outperforms both open-source and closed-source baselines on AQuA, marking the first step toward intelligent VQA systems capable of managing uncertainty rather than blindly generating answers.

Technology Category

Application Category

📝 Abstract

Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.

Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering

Ambiguity

Response Strategy

Vision-Language Models

Uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ambiguous Visual Question Answering

Strategy-aware Response

Vision-Language Models