MultiHoax: A Dataset of Multi-hop False-Premise Questions

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current large language models exhibit severe deficiencies in detecting false premises during multi-hop reasoning, while mainstream benchmarks predominantly cover single-hop scenarios and thus fail to reflect real-world complex reasoning demands. Method: This paper introduces MultiHoax, the first benchmark specifically designed for multi-hop false premise detection. Built upon Wikipedia, it constructs a multi-source factual knowledge base spanning ten knowledge domains across seven countries. Leveraging expert human annotation and logic-driven structural design, MultiHoax supports multi-step reasoning path modeling and fine-grained premise truth labeling. Contribution/Results: It formally defines and implements, for the first time, cross-national, cross-domain, and multi-step premise consistency verification. Empirical evaluation reveals that state-of-the-art LMs achieve an average accuracy below 40% on MultiHoax, sharply exposing their critical limitations in skeptical, premise-aware reasoning.

Technology Category

Application Category

📝 Abstract

As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to detect false premises in multi-step reasoning

Addressing the gap in benchmarks for multi-hop false-premise questions

Improving LLMs' robustness in detecting false assumptions across diverse contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MultiHoax benchmark for multi-hop FPQs

Uses Wikipedia for factual reasoning across regions

Evaluates LLMs on false premise detection capabilities

🔎 Similar Papers

No similar papers found.