🤖 AI Summary
Current large language models exhibit severe deficiencies in detecting false premises during multi-hop reasoning, while mainstream benchmarks predominantly cover single-hop scenarios and thus fail to reflect real-world complex reasoning demands. Method: This paper introduces MultiHoax, the first benchmark specifically designed for multi-hop false premise detection. Built upon Wikipedia, it constructs a multi-source factual knowledge base spanning ten knowledge domains across seven countries. Leveraging expert human annotation and logic-driven structural design, MultiHoax supports multi-step reasoning path modeling and fine-grained premise truth labeling. Contribution/Results: It formally defines and implements, for the first time, cross-national, cross-domain, and multi-step premise consistency verification. Empirical evaluation reveals that state-of-the-art LMs achieve an average accuracy below 40% on MultiHoax, sharply exposing their critical limitations in skeptical, premise-aware reasoning.
📝 Abstract
As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs' ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable factual reasoning across regions. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.