Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the validity and reliability of AI-based automated scoring for handwritten general chemistry examinations. Addressing diverse item types—including chemical equations, open-ended textual responses, numerical derivations, and hand-drawn graphs—the authors develop an end-to-end AI scoring system integrating optical character recognition (OCR), chemical formula recognition, image parsing, and psychometric evaluation, augmented by a selective human-in-the-loop scoring mechanism. Experimental results show high inter-rater agreement between AI and human graders on textual and chemical equation items (ICC > 0.90), but substantially lower agreement on numerical computation and graphical items (ICC < 0.75), underscoring the necessity of human oversight. The primary contributions are: (1) the first comprehensive AI scoring framework for handwritten chemistry exams covering all major item types; and (2) the establishment of an interpretable, auditable human-AI collaborative assessment paradigm, providing empirical evidence and methodological foundations for trustworthy deployment of intelligent educational assessment.

Technology Category

Application Category

📝 Abstract
We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI grading reliability for handwritten chemistry exams
Comparing AI and human scores across question types
Assessing grading accuracy for textual versus numerical responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-based grading system for handwritten exams
Image uploads for chemical equations and sketches
Linear regression and psychometric evaluation comparisons
🔎 Similar Papers
No similar papers found.
J
Jan Cvengros
Department of Chemistry and Applied Biosciences, ETH Zurich, HCI H 101, Vladimir-Prelog-Weg 1-5/10, 8093 Zürich, Switzerland
Gerd Kortemeyer
Gerd Kortemeyer
ETH Zurich and Michigan State University
Physics EducationOnline AssessmentAI in Education