Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

📅 2024-07-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the implicit scoring logic of large language models (LLMs) in automated grading of scientific open-ended questions and its alignment with human scoring criteria. We identify a critical bias: LLMs over-rely on superficial linguistic features rather than deep domain-specific reasoning. To address this, we propose *rubric-aware prompting*—a novel method that explicitly integrates human-designed analytical scoring rubrics into the prompting process. Through systematic experiments across multiple LLMs, temperature settings, and context configurations—complemented by semantic alignment analysis—we uncover a substantial rubric gap between LLM and human graders. Incorporating high-quality, human-curated rubrics improves LLM grading accuracy by an average of 12.3% and achieves a Cohen’s Kappa of 0.81, approaching expert inter-rater reliability. Our work establishes an interpretable, transferable methodological framework to enhance the validity and trustworthiness of LLM-based educational assessment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated strong potential in performing automatic scoring for constructed response assessments. While constructed responses graded by humans are usually based on given grading rubrics, the methods by which LLMs assign scores remain largely unclear. It is also uncertain how closely AI's scoring process mirrors that of humans or if it adheres to the same grading criteria. To address this gap, this paper uncovers the grading rubrics that LLMs used to score students' written responses to science tasks and their alignment with human scores. We also examine whether enhancing the alignments can improve scoring accuracy. Specifically, we prompt LLMs to generate analytic rubrics that they use to assign scores and study the alignment gap with human grading rubrics. Based on a series of experiments with various configurations of LLM settings, we reveal a notable alignment gap between human and LLM graders. While LLMs can adapt quickly to scoring tasks, they often resort to shortcuts, bypassing deeper logical reasoning expected in human grading. We found that incorporating high-quality analytical rubrics designed to reflect human grading logic can mitigate this gap and enhance LLMs' scoring accuracy. These results underscore the need for a nuanced approach when applying LLMs in science education and highlight the importance of aligning LLM outputs with human expectations to ensure efficient and accurate automatic scoring.
Problem

Research questions and friction points this paper is trying to address.

Explore LLM and human grading differences.
Analyze alignment between LLM and human scores.
Improve LLM scoring accuracy with human rubrics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate analytic rubrics
Align LLM outputs with humans
Enhance scoring accuracy via rubrics
🔎 Similar Papers
No similar papers found.