The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limitations of existing automated essay scoring systems in modeling dependencies among discourse elements—such as introductions, claims, evidence, and conclusions—which undermines coherence and generalization. To overcome this, the authors propose a task-aware sequential fine-tuning strategy that leverages 4-bit quantization and LoRA on LLaMA-3.1-8B, training the model to learn discourse components in their natural structural order. Evaluated on the PERSUADE 2.0 dataset, this approach significantly outperforms both independent training and random multi-task baselines, achieving an evidence F1 score of 65%, a conclusion F1 of 87%, and 85% accuracy in conclusion scoring—surpassing even the LLaMA-70B baseline. These results demonstrate that structure-aligned, lightweight fine-tuning offers both high efficacy and cost efficiency for educational assessment tasks.

📝 Abstract

Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

Problem

Research questions and friction points this paper is trying to address.

Automated Essay Scoring

discourse coherence

task dependencies

curriculum design

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential Fine-Tuning

Automated Essay Scoring

Curriculum Learning