THiNK: Can Large Language Models Think-aloud?

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models’ (LLMs) higher-order cognitive abilities—such as analysis, evaluation, and creation—lacks a structured, theoretically grounded assessment framework. Method: We propose THiNK, a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK employs a think-aloud mechanism comprising three iterative stages: question generation, critical reflection, and revision, thereby systematically quantifying LLMs’ full-spectrum cognitive capabilities—from remembering to creating. Contribution/Results: THiNK innovatively integrates educational psychology’s hierarchical cognition theory into LLM evaluation, establishing a measurable and improvable structured chain-of-thought feedback mechanism. Evaluated on seven state-of-the-art LLMs, THiNK significantly enhances performance on higher-order reasoning tasks—especially evaluation and creation—while qualitative analysis confirms that its outputs better align with domain-specific logic and problem structure.

Technology Category

Application Category

📝 Abstract
Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.
Problem

Research questions and friction points this paper is trying to address.

Assessing higher-order thinking skills in LLMs
Evaluating reasoning via iterative problem generation and critique
Improving LLM abstraction and knowledge application
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent feedback-driven evaluation framework
Iterative problem generation and refinement
Structured feedback loops improve reasoning
🔎 Similar Papers
No similar papers found.