🤖 AI Summary
Existing code comprehension assessments struggle to disentangle high-level functional intent understanding from low-level implementation details. Method: This study proposes function naming (FN) as a novel assessment paradigm—replacing traditional English-in-Programming-Explanation (EiPE) tasks—to specifically evaluate students’ grasp of code functionality. It is the first to apply Item Response Theory (IRT) to FN, rigorously establishing its reliability and validity. We develop an open-source, scalable Python auto-grading toolkit integrating large language model–assisted scoring with unit-test–based verification. Results: Evaluated in authentic introductory programming courses, FN achieves strong agreement with human EiPE scoring (Spearman ρ = 0.89), effectively discriminates among varying comprehension levels, and enables large-scale, objective, fine-grained assessment of code understanding proficiency.
📝 Abstract
"Explain in Plain English"(EiPE) questions are widely used to assess code comprehension skills but are challenging to grade automatically. Recent approaches like Code Generation Based Grading (CGBG) leverage large language models (LLMs) to generate code from student explanations and validate its equivalence to the original code using unit tests. However, this approach does not differentiate between high-level, purpose-focused responses and low-level, implementation-focused ones, limiting its effectiveness in assessing comprehension level. We propose a modified approach where students generate function names, emphasizing the function's purpose over implementation details. We evaluate this method in an introductory programming course and analyze it using Item Response Theory (IRT) to understand its effectiveness as exam items and its alignment with traditional EiPE grading standards. We also publish this work as an open source Python package for autograding EiPE questions, providing a scalable solution for adoption.