Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Encoder-based models (e.g., BERT) face a critical limitation in automated essay scoring (AES) due to their 512-token input constraint, resulting in inadequate comprehension and inconsistent scoring for long essays. To address this, we propose a novel LLM-based scoring framework that integrates text summarization with structured prompt engineering within a two-stage “summarize-then-score” paradigm. This design enables effective processing of lengthy inputs while generating interpretable, rationale-backed scores. Evaluated on the Learning Agency Lab AES 2.0 benchmark, our approach achieves a quadratic weighted kappa (QWK) of 0.8878—surpassing the BERT baseline (0.822) by 6.6 percentage points. The improvement demonstrates both scalability beyond fixed-length encoders and enhanced inter-rater consistency. Our work establishes a new, extensible pathway for high-fidelity AES of long-form compositions.

Technology Category

Application Category

📝 Abstract
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
Problem

Research questions and friction points this paper is trying to address.

Automated scoring of long essays using generative language models
Overcoming token limitations in encoder-based scoring models
Improving scoring accuracy through summarization and prompting techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using generative models for essay scoring
Summarizing long essays via prompting
Improving scoring accuracy with QWK
🔎 Similar Papers
No similar papers found.
H
Haowei Hua
Princeton University
Hong Jiao
Hong Jiao
University of Maryland, College Park
educational measurementpsychometrics
X
Xinyi Wang
University of Maryland, College Park & Beijing Normal University