Counting the Trees in the Forest: Evaluating Prompt Segmentation for Classifying Code Comprehension Level

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the automatic assessment of comprehension levels in novice programmers’ responses to “explain this code in simple English” tasks, focusing on distinguishing between multi-structural (line-by-line description) and relational (holistic functional explanation) cognitive patterns. We propose an LLM-driven zero-shot prompt segmentation method that uses paragraph count in student responses as a proxy for conceptual depth—enabling cognitive modeling without fine-tuning. Our approach integrates dual-track parsing of both source code and natural language explanations, and achieves high inter-annotator agreement with expert labels (Krippendorff’s α > 0.8). The resulting lightweight, open-source Python toolkit delivers real-time, interpretable formative feedback. To our knowledge, this is the first work to systematically leverage prompt segmentation behavior for cognitive classification in programming comprehension.

Technology Category

Application Category

📝 Abstract

Reading and understanding code are fundamental skills for novice programmers, and especially important with the growing prevalence of AI-generated code and the need to evaluate its accuracy and reliability. ``Explain in Plain English'' questions are a widely used approach for assessing code comprehension, but providing automated feedback, particularly on comprehension levels, is a challenging task. This paper introduces a novel method for automatically assessing the comprehension level of responses to ``Explain in Plain English'' questions. Central to this is the ability to distinguish between two response types: multi-structural, where students describe the code line-by-line, and relational, where they explain the code's overall purpose. Using a Large Language Model (LLM) to segment both the student's description and the code, we aim to determine whether the student describes each line individually (many segments) or the code as a whole (fewer segments). We evaluate this approach's effectiveness by comparing segmentation results with human classifications, achieving substantial agreement. We conclude with how this approach, which we release as an open source Python package, could be used as a formative feedback mechanism.

Problem

Research questions and friction points this paper is trying to address.

Automated assessment of code comprehension levels.

Distinguishing multi-structural vs. relational responses.

Using LLM for segmentation to evaluate student understanding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM for code and response segmentation

Distinguishes multi-structural vs. relational responses

Open source Python package for feedback

🔎 Similar Papers

Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models

2024-07-30arXiv.orgCitations: 0

Authors to Follow