🤖 AI Summary
This study systematically evaluates the capacity of the lightweight open-source large language model Llama 3.2 (3B) to deliver formative feedback for novice Java programming learners. Addressing the need for privacy-preserving, locally deployable AI teaching assistants, we conduct qualitative content analysis grounded in an established programming education feedback assessment framework, using authentic student code submissions. Multi-dimensional manual coding and attribution analysis reveal critical limitations: while the model consistently generates syntactically correct feedback, it frequently exhibits erroneous root-cause attribution, lacks conceptual explanations, demonstrates insufficient diagnostic depth, and provides weak personalized scaffolding. Our contributions include (1) establishing the first empirically grounded benchmark for assessing feedback quality of lightweight LLMs in programming education, (2) identifying concrete pedagogical improvement pathways, and (3) providing theoretical and practical guidance for deploying resource-efficient, highly controllable AI tutors in educational settings.
📝 Abstract
Large Language Models (LLMs) have been subject to extensive research in the past few years. This is particularly true for the potential of LLMs to generate formative programming feedback for novice learners at university. In contrast to Generative AI (GenAI) tools based on LLMs, such as GPT, smaller and open models have received much less attention. Yet, they offer several benefits, as educators can let them run on a virtual machine or personal computer. This can help circumvent some major concerns applicable to other GenAI tools and LLMs (e. g., data protection, lack of control over changes, privacy). Therefore, this study explores the feedback characteristics of the open, lightweight LLM Llama 3.2 (3B). In particular, we investigate the models' responses to authentic student solutions to introductory programming exercises written in Java. The generated output is qualitatively analyzed to help evaluate the feedback's quality, content, structure, and other features. The results provide a comprehensive overview of the feedback capabilities and serious shortcomings of this open, small LLM. We further discuss the findings in the context of previous research on LLMs and contribute to benchmarking recently available GenAI tools and their feedback for novice learners of programming. Thereby, this work has implications for educators, learners, and tool developers attempting to utilize all variants of LLMs (including open, and small models) to generate formative feedback and support learning.