🤖 AI Summary
Large language models (LLMs) frequently generate hallucinations when answering ambiguous queries due to misaligned implicit assumptions and user intent. Existing approaches—such as AmbigQA and ASQA—rely on human-annotated clarifications or inherit human biases, limiting generalizability. To address this, we propose Conditional Ambiguous Question-Answering (CondAmbigQA), a novel benchmark that introduces the concept of *conditions*: contextual constraints or implicit premises required to resolve ambiguity. We design a retrieval-driven, automated condition annotation pipeline using Wikipedia passages, yielding low-bias, reproducible condition labels. Furthermore, we establish the first condition-aware RAG evaluation paradigm. Experiments demonstrate that integrating condition reasoning improves model performance by 20%, with an additional 5% gain when conditions are explicitly surfaced in outputs. The benchmark comprises 200 ambiguous questions paired with expert-validated condition annotations and is publicly released to advance hallucination mitigation research.
📝 Abstract
Large language models (LLMs) are prone to hallucinations in question-answering (QA) tasks when faced with ambiguous questions. Users often assume that LLMs share their cognitive alignment, a mutual understanding of context, intent, and implicit details, leading them to omit critical information in the queries. However, LLMs generate responses based on assumptions that can misalign with user intent, which may be perceived as hallucinations if they misalign with the user's intent. Therefore, identifying those implicit assumptions is crucial to resolve ambiguities in QA. Prior work, such as AmbigQA, reduces ambiguity in queries via human-annotated clarifications, which is not feasible in real application. Meanwhile, ASQA compiles AmbigQA's short answers into long-form responses but inherits human biases and fails capture explicit logical distinctions that differentiates the answers. We introduce Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark with 200 ambiguous queries and condition-aware evaluation metrics. Our study pioneers the concept of ``conditions'' in ambiguous QA tasks, where conditions stand for contextual constraints or assumptions that resolve ambiguities. The retrieval-based annotation strategy uses retrieved Wikipedia fragments to identify possible interpretations for a given query as its conditions and annotate the answers through those conditions. Such a strategy minimizes human bias introduced by different knowledge levels among annotators. By fixing retrieval results, CondAmbigQA evaluates how RAG systems leverage conditions to resolve ambiguities. Experiments show that models considering conditions before answering improve performance by $20%$, with an additional $5%$ gain when conditions are explicitly provided. These results underscore the value of conditional reasoning in QA, offering researchers tools to rigorously evaluate ambiguity resolution.