FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

๐Ÿ“… 2026-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

208K/year
๐Ÿค– AI Summary
This study evaluates the hypothesis-driven inductive reasoning capabilities of large language models (LLMs) in the context of scientific discovery. Inspired by the Wason 2-4-6 task, we introduce an interactive rule-discovery benchmark that requires models to propose examples, receive feedback, and iteratively uncover hidden rulesโ€”thereby simulating the core scientific reasoning processes of hypothesis generation, evidence gathering, and belief revision. For the first time, hypothesis testing and falsification behaviors are formally incorporated into the quantitative assessment of LLMsโ€™ scientific reasoning. Experiments across twelve mainstream models reveal that those with explicit reasoning mechanisms outperform purely instruction-tuned counterparts, and models actively engaging in falsification tests achieve significantly better performance. Nevertheless, overall results remain substantially below ideal levels, exposing fine-grained failure modes in hypothesis-space exploration.
๐Ÿ“ Abstract
Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.
Problem

Research questions and friction points this paper is trying to address.

inductive reasoning
scientific discovery
hypothesis testing
falsification
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

inductive reasoning
hypothesis falsification
scientific reasoning
rule discovery
LLM evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.