Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility and effectiveness of multimodal large language models (MLLMs) for automated grading of large-scale handwritten calculus examinations. To address challenges in structured understanding of open-ended responses and fine-grained partial-credit assignment, we propose: (1) spatially anchored handwriting content localization; (2) a risk-aware confidence filtering mechanism grounded in the two-parameter logistic (2PL) item response theory model; and (3) a partial-credit scoring framework explicitly aligned with teaching assistant rubrics. We further introduce a human-AI collaborative filtering pipeline wherein only high-risk or low-confidence predictions undergo human review after AI-based initial scoring. Experiments under stringent quality constraints demonstrate that the system achieves human-level scoring accuracy, fully automates grading for approximately 30% of routine questions, and substantially reduces manual grading effort. The work provides a reproducible methodology and empirical validation for trustworthy LLM deployment in educational assessment.

Technology Category

Application Category

📝 Abstract
We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.
Problem

Research questions and friction points this paper is trying to address.

Assisting grading handwritten calculus exams using multimodal AI systems
Evaluating AI-human grading agreement validity for open-ended responses
Developing confidence filters to route ambiguous cases to human graders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop filter combines partial-credit threshold with IRT risk
AI grading calibrated using deviation from model-expected student scores
Confidence filtering routes ambiguous cases to human graders
🔎 Similar Papers
No similar papers found.
Gerd Kortemeyer
Gerd Kortemeyer
ETH Zurich and Michigan State University
Physics EducationOnline AssessmentAI in Education
A
Alexander Caspar
Department of Mathematics, ETH Zürich, Rämistrasse 101, Zürich, 8092, Switzerland.
D
Daria Horica
Rectorate and ETH AI Center, ETH Zürich, Rämistrasse 101, Zürich, 8092, Switzerland.