🤖 AI Summary
Evaluating large language models’ (LLMs) higher-order cognitive abilities—such as analysis, evaluation, and creation—lacks a structured, theoretically grounded assessment framework. Method: We propose THiNK, a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK employs a think-aloud mechanism comprising three iterative stages: question generation, critical reflection, and revision, thereby systematically quantifying LLMs’ full-spectrum cognitive capabilities—from remembering to creating. Contribution/Results: THiNK innovatively integrates educational psychology’s hierarchical cognition theory into LLM evaluation, establishing a measurable and improvable structured chain-of-thought feedback mechanism. Evaluated on seven state-of-the-art LLMs, THiNK significantly enhances performance on higher-order reasoning tasks—especially evaluation and creation—while qualitative analysis confirms that its outputs better align with domain-specific logic and problem structure.
📝 Abstract
Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.