A Classroom Study of LLM-Generated Feedback Intervention in Introductory Programming

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the lack of empirical comparisons of different large language model (LLM) feedback modalities in authentic classroom settings. Conducting a large-scale randomized controlled trial in an introductory Python programming course, the research investigates the impact of natural language feedback, AI-generated failing test cases, and no AI feedback on students’ programming behaviors and learning outcomes, yielding the ProgFeed dataset comprising 6,693 code submissions. This work presents the first systematic evaluation of multiple LLM-based feedback forms in a real-world educational context. The findings reveal that natural language feedback significantly improves assignment completion rates and accelerates convergence to correct solutions, while the efficacy of AI-generated test cases is highly contingent on their quality—highlighting the critical importance of feedback effectiveness for the successful deployment of AI in education.

📝 Abstract

Large language models (LLMs) are increasingly used to provide automated feedback in introductory programming courses, yet empirical evidence from authentic classroom deployments comparing different feedback modalities remains limited. In this work, we present a large-scale classroom study in which AI-generated feedback was deployed through a randomized protocol in an introductory Python programming course. Students received one of three feedback conditions on incorrect submissions: natural language hints, AI-generated failing test cases, or no AI feedback. We release the resulting dataset, ProgFeed, which captures 6,693 submissions from 215 consenting students across 17 labs, including feedback conditions, execution-based performance measures, and fine-grained temporal information. Using this data, we analyze learning trajectories, feedback quality, and submission behavior over repeated attempts. We find that natural language feedback is significantly associated with higher completion rates and faster convergence to correct solutions. Test case feedback, by contrast, exhibits heterogeneous effects that depend critically on feedback validity. Our results suggest that the form of AI-generated feedback matters, and that evaluating feedback quality -- not just its presence -- is essential for understanding its pedagogical impact.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated feedback

introductory programming

classroom study

feedback modality

learning outcomes

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated feedback

classroom study

natural language hints