LLMs Do Not Grade Essays Like Humans

๐Ÿ“… 2026-03-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the unclear alignment between large language models (LLMs) and human raters in automated essay scoring. It presents the first systematic evaluation of off-the-shelf GPT and Llama family models in a zero-shot setting, combining correlation analysis with error pattern identification. The findings reveal that these models rely on different signals than human gradersโ€”tending to overrate short essays and underrate longer ones containing minor grammatical errors. Although model scores show only weak agreement with human ratings, they exhibit strong internal consistency with the qualitative feedback the models generate, suggesting potential utility as assistive grading tools. This work provides critical empirical evidence for understanding the scoring mechanisms of LLMs and their implications for educational applications.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.
Problem

Research questions and friction points this paper is trying to address.

automated essay scoring
large language models
human grading alignment
essay evaluation
LLM-human disagreement
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated essay scoring
large language models
human-LLM alignment
grading behavior
feedback consistency
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jerin George Mathew
University of Alberta, 116 St & 85 Ave, Edmonton, AB, T6G 2R3, Canada
S
Sumayya Taher
University of Alberta, 116 St & 85 Ave, Edmonton, AB, T6G 2R3, Canada
Anindita Kundu
Anindita Kundu
Associate Professor Vellore Institute of Technology
Internet of ThingsRobotic Path Planning and Opportunistic Networks
Denilson Barbosa
Denilson Barbosa
Professor, Computing Science, University of Alberta
Information Extractionnatural language processingKnowledge Graphs