Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work investigates the limitations of large language models (LLMs) in inductive and abductive reasoning—particularly their difficulty generating concise, plausible hypotheses under incomplete world models and sparse observations. To address this, we introduce InAbHyD, the first programmable synthetic benchmark enabling controlled world modeling and observation generation. We further propose a novel evaluation metric grounded in Occam’s Razor, jointly quantifying hypothesis simplicity and explanatory power. Systematic experiments reveal that while current LLMs exhibit basic abductive capability in simple settings, they struggle to produce high-quality, parsimonious hypotheses under complex world models; moreover, standard reasoning enhancements—including in-context learning and RLVR—yield only marginal improvements. Our contributions include: (1) a new benchmark for scientific reasoning evaluation, (2) a principled metric for hypothesis quality, and (3) empirical insights into the fundamental constraints of LLMs in abductive inference.

Technology Category

Application Category

📝 Abstract

Reasoning is a core capability in artificial intelligence systems, for which large language models (LLMs) have recently shown remarkable progress. However, most work focuses exclusively on deductive reasoning, which is problematic since other types of reasoning are also essential in solving real-world problems, and they are less explored. This work focuses on evaluating LLMs' inductive and abductive reasoning capabilities. We introduce a programmable and synthetic dataset, InAbHyD (pronounced in-a-bid), where each reasoning example consists of an incomplete world model and a set of observations. The task for the intelligent agent is to produce hypotheses to explain observations under the incomplete world model to solve each reasoning example. We propose a new metric to evaluate the quality of hypotheses based on Occam's Razor. We evaluate and analyze some state-of-the-art LLMs. Our analysis shows that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' inductive and abductive reasoning capabilities

Assessing hypothesis quality under incomplete world models

Testing reasoning performance with complex scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset InAbHyD for reasoning evaluation

New metric based on Occam's Razor principle

Testing LLMs with incomplete world models

🔎 Similar Papers

BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models