Judgment of Learning: A Human Ability Beyond Generative Artificial Intelligence

📅 2024-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) possess human-level judgment of learning (JOL)—a metacognitive ability to predict one’s own memory performance. Using garden-path sentences with experimentally manipulated contextual plausibility, we systematically compared JOL ratings and subsequent recall accuracy between human participants and three LLMs (GPT-3.5-turbo, GPT-4-turbo, and GPT-4o). Results show that human JOL significantly predicted recall performance (p < 0.05), whereas none of the LLMs exhibited statistically significant predictive validity (all p > 0.05). Critically, LLM JOLs remained invariant across contextual plausibility conditions and failed to align with actual recall outcomes. This work provides the first evidence of a systematic metacognitive deficit in LLMs, introducing a cross-subject predictive paradigm that demonstrates their lack of reliable self-monitoring and learning assessment capacity—implications critical for educational AI, adaptive learning systems, and trustworthy human-AI interaction.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) increasingly mimic human cognition in various language-based tasks. However, their capacity for metacognition - particularly in predicting memory performance - remains unexplored. Here, we introduce a cross-agent prediction model to assess whether ChatGPT-based LLMs align with human judgments of learning (JOL), a metacognitive measure where individuals predict their own future memory performance. We tested humans and LLMs on pairs of sentences, one of which was a garden-path sentence - a sentence that initially misleads the reader toward an incorrect interpretation before requiring reanalysis. By manipulating contextual fit (fitting vs. unfitting sentences), we probed how intrinsic cues (i.e., relatedness) affect both LLM and human JOL. Our results revealed that while human JOL reliably predicted actual memory performance, none of the tested LLMs (GPT-3.5-turbo, GPT-4-turbo, and GPT-4o) demonstrated comparable predictive accuracy. This discrepancy emerged regardless of whether sentences appeared in fitting or unfitting contexts. These findings indicate that, despite LLMs' demonstrated capacity to model human cognition at the object-level, they struggle at the meta-level, failing to capture the variability in individual memory predictions. By identifying this shortcoming, our study underscores the need for further refinements in LLMs' self-monitoring abilities, which could enhance their utility in educational settings, personalized learning, and human-AI interactions. Strengthening LLMs' metacognitive performance may reduce the reliance on human oversight, paving the way for more autonomous and seamless integration of AI into tasks requiring deeper cognitive awareness.
Problem

Research questions and friction points this paper is trying to address.

AI Model Accuracy
Memory Prediction
Human-like Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChatGPT
Self-memory Prediction
Artificial Intelligence Autonomy
Markus Huff
Markus Huff
Professor of Applied Cognitive Psychology, Leibniz-Institut für Wissensmedien Tübingen
Applied Cognitive PsychologyEvent CognitionDynamic ScenesArtificial Intelligence
E
Elanur Ulakçı
Leibniz-Institut für Wissensmedien, Tübingen, Germany; Eberhard Karls Universität Tübingen, Germany