Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether reasoning texts generated by large language models (LLMs; specifically GPT-4.1/GPT-5) can enhance the validity of automated essay scoring (AES) under class imbalance—particularly when zero-score essays are scarce. We propose three modeling paradigms: (1) text-only baseline, (2) LLM-generated explanatory reasoning, and (3) an ensemble integrating both original essays and dual reasoning modalities. Structured reasoning is elicited via prompt engineering, and model performance is evaluated using quadratic weighted kappa (QWK) and F1-score. Results show that while single-modality reasoning yields marginally lower overall QWK, it substantially improves F1 for rare classes (e.g., zero scores). The fused multimodal model achieves QWK = 0.870—surpassing the prior state-of-the-art (0.848)—demonstrating that LLM-generated explanations enhance scoring consistency and long-tail class discrimination. To our knowledge, this is the first systematic empirical study validating LLM-generated explanations for mitigating class bias and improving robustness in AES.

Technology Category

Application Category

📝 Abstract
This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.
Problem

Research questions and friction points this paper is trying to address.

Comparing essay-based and rationale-based automated essay scoring methods
Investigating class imbalance impact on scoring accuracy using LLM rationales
Developing ensemble models combining multiple scoring approaches for improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using ensemble modeling to combine essay-based scoring
Leveraging rationales from large language models for scoring
Improving scoring accuracy with hybrid ensemble methods
🔎 Similar Papers
No similar papers found.
Hong Jiao
Hong Jiao
University of Maryland, College Park
educational measurementpsychometrics
H
Hanna Choi
University of Maryland, College Park
H
Haowei Hua
Princeton University