Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Assessing the quality of instructional feedback generated by large language models (LLMs) for scientific inquiry experiment design—particularly relative to human experts—remains underexplored in authentic science education contexts. Method: We conducted a double-blind comparative evaluation between LLM-generated and human expert feedback using a six-dimensional framework (Feed Up/Back/Forward, constructive tone, linguistic clarity, terminology accuracy), implemented via an LLM-based feedback generation agent; four domain experts independently rated feedback samples on a 5-point Likert scale. Contribution/Results: LLMs achieved overall feedback quality statistically indistinguishable from human experts, demonstrating strong potential for efficient initial screening. However, they significantly underperformed on the “Feed Back” dimension—specifically in error identification and contextualized explanation—revealing limitations in diagnostic depth. This study presents the first systematic, context-embedded comparison of LLM versus human feedback quality in science education, proposing a novel “human-AI collaborative feedback” paradigm and providing empirical grounding and actionable pathways for AI integration in science pedagogy.

Technology Category

Application Category

📝 Abstract
Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.
Problem

Research questions and friction points this paper is trying to address.

Compare feedback quality of LLMs and teachers
Evaluate effectiveness of AI in educational feedback
Identify limitations of LLMs in contextual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs provide adaptive feedback instantly
Feedback quality compared between LLMs and teachers
Combining LLMs with human expertise enhances education
🔎 Similar Papers
No similar papers found.