๐ค AI Summary
This study addresses the lack of systematic comparison between large language models (LLMs) and human experts in key dimensions of argumentative essay feedbackโnamely goal orientation, anchoring behavior, and priority assignment. The authors introduce the FOXGLOVE dataset, comprising 696 expert-generated and 1,644 LLM-generated feedback instances, enabling the first large-scale comparative analysis under a unified protocol across four state-of-the-art models and human experts. A multidimensional quality assessment framework integrates human annotation, expert scoring, and natural language processing techniques. Results indicate that while LLM-generated feedback scores higher on most quality dimensions, this advantage is largely attributable to greater verbosity. Although humans and models exhibit similar distributions in feedback placement across essays, they diverge significantly in sentence-level anchoring, revealing fundamental differences in their underlying feedback mechanisms.
๐ Abstract
While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.