Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study rigorously evaluates whether proprietary large language models (LLMs) remain irreplaceable for automated essay scoring (AES), addressing critical concerns regarding performance, fairness, and cost-efficiency. Method: We systematically benchmark nine leading closed- and open-source LLMs—including GPT-4, Llama 3, and Qwen2.5—using few-shot prompting, semantic embedding analysis, ML-driven scoring regression, and disparate impact scoring to assess predictive accuracy, demographic fairness (by age and race), and generated essay quality. Contribution/Results: Empirical results show no statistically significant differences between top-tier open-source models (e.g., Llama 3, Qwen2.5) and GPT-4 in accuracy or fairness; moreover, their inference cost is only ~2.7% of GPT-4’s. This work provides the first empirical evidence that high-quality open-source LLMs can simultaneously achieve competitive accuracy, equitable outcomes, and operational affordability—establishing a foundational basis for scalable, inclusive deployment of AI in educational assessment.

Technology Category

Application Category

📝 Abstract
Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.
Problem

Research questions and friction points this paper is trying to address.

Compare performance of closed vs. open LLMs in essay scoring.
Assess fairness and cost-efficiency of open LLMs like Llama 3.
Evaluate generative capabilities of open LLMs versus closed LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open LLMs match GPT-4 in essay scoring
Llama 3 offers 37x cost efficiency over GPT-4
Open LLMs maintain fairness and competitive performance
🔎 Similar Papers
No similar papers found.
K
Kezia Oketch
Department of IT, Analytics, and Operations, University of Notre Dame
J
John P. Lalor
Department of IT, Analytics, and Operations, University of Notre Dame
Y
Yi Yang
Department of Information Systems, Business Statistics and Operations Management, Hong Kong University of Science and Technology
Ahmed Abbasi
Ahmed Abbasi
Giovanini Endowed Chair Professor, University of Notre Dame
Artificial IntelligenceMachine LearningNatural Language ProcessingPredictive Analytics