Rationality Check! Benchmarking the Rationality of Large Language Models

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the fundamental question of whether large language models (LLMs) exhibit human-like rationality. To this end, it introduces the first comprehensive evaluation benchmark covering both theoretical rationality (e.g., logical consistency) and practical rationality (e.g., preference coherence). Methodologically, it integrates principles from cognitive science and behavioral economics to design multi-domain, context-sensitive assessment tasks, and develops an open-source, extensible automated evaluation toolkit. The contributions are threefold: (1) a systematic, theory-grounded framework for assessing LLM rationality; (2) empirical evaluations across mainstream LLMs, revealing critical boundaries and cross-model disparities in rational behavior; and (3) a reproducible benchmark and analytical foundation to guide model refinement, trustworthy AI development, and rationality alignment research. (136 words)

Technology Category

Application Category

📝 Abstract
Large language models (LLMs), a recent advance in deep learning and machine intelligence, have manifested astonishing capacities, now considered among the most promising for artificial general intelligence. With human-like capabilities, LLMs have been used to simulate humans and serve as AI assistants across many applications. As a result, great concern has arisen about whether and under what circumstances LLMs think and behave like real human agents. Rationality is among the most important concepts in assessing human behavior, both in thinking (i.e., theoretical rationality) and in taking action (i.e., practical rationality). In this work, we propose the first benchmark for evaluating the omnibus rationality of LLMs, covering a wide range of domains and LLMs. The benchmark includes an easy-to-use toolkit, extensive experimental results, and analysis that illuminates where LLMs converge and diverge from idealized human rationality. We believe the benchmark can serve as a foundational tool for both developers and users of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking rationality of large language models
Assessing LLM convergence with human rationality
Evaluating theoretical and practical reasoning in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for evaluating LLM rationality
Toolkit covering multiple domains and models
Analysis of convergence with human rationality
🔎 Similar Papers
No similar papers found.