🤖 AI Summary
This study investigates the feasibility and effectiveness of multimodal large language models (MLLMs) for automated grading of large-scale handwritten calculus examinations. To address challenges in structured understanding of open-ended responses and fine-grained partial-credit assignment, we propose: (1) spatially anchored handwriting content localization; (2) a risk-aware confidence filtering mechanism grounded in the two-parameter logistic (2PL) item response theory model; and (3) a partial-credit scoring framework explicitly aligned with teaching assistant rubrics. We further introduce a human-AI collaborative filtering pipeline wherein only high-risk or low-confidence predictions undergo human review after AI-based initial scoring. Experiments under stringent quality constraints demonstrate that the system achieves human-level scoring accuracy, fully automates grading for approximately 30% of routine questions, and substantially reduces manual grading effort. The work provides a reproducible methodology and empirical validation for trustworthy LLM deployment in educational assessment.
📝 Abstract
We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.