CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models (LLMs) in high-stakes educational assessment, where overconfidence and declining reliability over time hinder their effectiveness. To mitigate these issues, the authors propose a human-in-the-loop short-answer scoring framework that integrates calibrated confidence estimation. Specifically, model confidence is calibrated via posterior temperature scaling, enabling dynamic routing: high-confidence predictions are scored automatically, while low-confidence cases are deferred to human raters. The framework further incorporates a continual learning mechanism to adapt to evolving scoring rubrics and novel question types. Evaluated on three datasets, the approach automatically scores 35%–65% of responses at expert-level quality (QWK ≥ 0.80), with a substantial QWK gap of 0.347 between accepted and rejected predictions, thereby validating the efficacy of confidence-based routing.

Technology Category

Application Category

📝 Abstract
Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.
Problem

Research questions and friction points this paper is trying to address.

automated grading
uncertainty quantification
human-in-the-loop
large language models
educational assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

calibrated confidence estimation
human-in-the-loop
selective prediction
continual learning
automated grading
🔎 Similar Papers
No similar papers found.
P
Pranav Raikote
Department of Computer and Systems Sciences, Stockholm University, 164 25 Kista, Sweden
Korbinian Randl
Korbinian Randl
PhD student at Stockholm University
Explainable Machine Learning in NLP
I
Ioanna Miliou
Department of Computer and Systems Sciences, Stockholm University, 164 25 Kista, Sweden
A
Athanasios Lakes
Department of Computer and Systems Sciences, Stockholm University, 164 25 Kista, Sweden
P
Panagiotis Papapetrou
Department of Computer and Systems Sciences, Stockholm University, 164 25 Kista, Sweden