Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current educational evaluations of large language models are largely confined to general correctness or rely on human scoring that is difficult to scale, making it challenging to assess fine-grained pedagogical capabilities—such as scaffolding and value integration—in long-tail instructional scenarios. This work proposes the Elmes* framework, which leverages a multi-agent teacher–student–evaluator interaction engine and a self-evolving SceneGen module to automatically construct and refine scenario-specific evaluation criteria. It presents the first automated, fine-grained scoring system tailored for long-tail educational contexts, enabling multidimensional diagnostic assessment and introducing Edu-330, a large-scale benchmark comprising 330 diverse scenarios. Experiments reveal significant performance disparities among leading models in creativity and value alignment, with the specialized model InnoSpark achieving the best results; furthermore, LLM-based evaluators yield rankings consistent with human judgments while exhibiting lower variance.
📝 Abstract
Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes*, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes* combines a declarative multi-agent engine for teacher--student--judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes*, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1{,}000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human--LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes* thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

large language models
educational evaluation
fine-grained rubrics
long-tail scenarios
pedagogical assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained evaluation rubrics
multi-agent educational evaluation
self-evolving evaluation framework
pedagogically grounded LLM assessment
long-tail educational scenarios
T
Tao Liu
Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, China; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
Y
Ye Lu
Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, China; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
R
Ruohua Zhang
Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, China; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
S
Siyu Song
Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, China; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
Wentao Liu
Wentao Liu
School of Artificial Intelligence, Beijing University of Posts and Telecommunications,
Medical image analysisSurgical navigation
A
Aimin Zhou
Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, China; School of Computer Science and Technology, East China Normal University, Shanghai 200062, China; Shanghai Innovation Institute, Shanghai 200231, China
Hao Hao
Hao Hao
East China Normal University