🤖 AI Summary
This study addresses the absence of cognitive agency—specifically, the capacity for reflective belief construction, dynamic updating, and self-monitoring—in large language models (LLMs). To this end, we introduce Reflection-Bench, the first comprehensive benchmark explicitly designed to evaluate reflection across seven cognitive dimensions: perception, memory, belief updating, decision-making, prediction, counterfactual reasoning, and meta-reflection. We systematically assess 13 state-of-the-art LLMs. Grounded in cognitive science, our work provides the first formal definition and quantification of AI reflection capability, featuring multi-level interactive tasks that uniquely address critical gaps in belief updating and metacognitive evaluation. Experimental results reveal that even top-tier models—including GPT-4, Claude 3.5, and o1—exhibit error rates exceeding 60% on belief updating and counterfactual reasoning tasks, underscoring their fundamental lack of closed-loop cognitive regulation mechanisms.
📝 Abstract
The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.