ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language model (LLM) evaluations in education lack standardized, pedagogy-oriented metrics; mainstream benchmarks emphasize general intelligence rather than educational suitability. Method: We propose a configurable multi-agent dialogue evaluation framework that integrates a modular architecture with an LLM-as-a-Judge mechanism, enabling dynamic construction of diverse instructional scenarios and objective, fine-grained quantification of teaching behaviors. The framework incorporates a hybrid evaluation engine and an education-specific metric system. Contribution/Results: We systematically evaluate leading LLMs across four canonical teaching scenarios—revealing, for the first time, significant disparities in their capabilities across teaching planning, feedback generation, and cognitive scaffolding. This work bridges a critical gap in pedagogically grounded LLM assessment and substantially reduces the cost of model selection and pedagogical adaptation for educational applications.

Technology Category

Application Category

📝 Abstract
The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at emph{https://github.com/sii-research/elmes.git}.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized evaluation metrics for LLMs in education
Current benchmarks focus on general intelligence, not teaching skills
Need for flexible framework to assess pedagogical capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular architecture for dynamic multi-agent dialogues
Hybrid evaluation engine with LLM-as-a-Judge
Fine-grained metrics for educational scenarios
🔎 Similar Papers
No similar papers found.
S
Shou'ang Wei
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
X
Xinyun Wang
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
S
Shuzhen Bi
Shanghai Innavation Institute, Shanghai, 200231, China
J
Jian Chen
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
Ruijia Li
Ruijia Li
East China Normal University
AI in Education
B
Bo Jiang
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
X
Xin Lin
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
M
Min Zhang
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
Y
Yu Song
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
B
BingDong Li
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China
A
Aimin Zhou
Shanghai Innavation Institute, Shanghai, 200231, China; Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China; School of Computer Science and Technology, East China Normal University, Shanghai, 200062, China
H
Hao Hao
Shanghai Institute of AI for Education, East China Normal University, Shanghai, 200062, China