🤖 AI Summary
This study addresses the challenge of delivering timely and personalized feedback in large-scale programming courses. To this end, the authors propose a locally deployed automated grading system that uniquely integrates role-based prompt engineering with large language models (LLMs). The system validates functional correctness through unit tests and leverages LLMs to generate interpretable, pedagogically oriented feedback on code quality while maintaining transparency in its reasoning process. In a pilot deployment involving 191 students, the AI-generated scores showed no significant linear correlation with human grades (r = −0.177) but exhibited a similar distribution shape. Although the AI scoring was notably more conservative (mean = 59.95 vs. 80.53), it substantially outperformed human graders in coverage of technical details and depth of feedback.
📝 Abstract
Large programming courses struggle to provide timely, detailed feedback on student code. We developed Mark My Works, a local autograding system that combines traditional unit testing with LLM-generated explanations. The system uses role-based prompts to analyze submissions, critique code quality, and generate pedagogical feedback while maintaining transparency in its reasoning process. We piloted the system in a 191-student engineering course, comparing AI-generated assessments with human grading on 79 submissions. While AI scores showed no linear correlation with human scores (r = -0.177, p = 0.124), both systems exhibited similar left-skewed distributions, suggesting they recognize comparable quality hierarchies despite different scoring philosophies. The AI system demonstrated more conservative scoring (mean: 59.95 vs 80.53 human) but generated significantly more detailed technical feedback.