🤖 AI Summary
Encoder-based models (e.g., BERT) face a critical limitation in automated essay scoring (AES) due to their 512-token input constraint, resulting in inadequate comprehension and inconsistent scoring for long essays. To address this, we propose a novel LLM-based scoring framework that integrates text summarization with structured prompt engineering within a two-stage “summarize-then-score” paradigm. This design enables effective processing of lengthy inputs while generating interpretable, rationale-backed scores. Evaluated on the Learning Agency Lab AES 2.0 benchmark, our approach achieves a quadratic weighted kappa (QWK) of 0.8878—surpassing the BERT baseline (0.822) by 6.6 percentage points. The improvement demonstrates both scalability beyond fixed-length encoders and enhanced inter-rater consistency. Our work establishes a new, extensible pathway for high-fidelity AES of long-form compositions.
📝 Abstract
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.