🤖 AI Summary
Current educational evaluations of large language models are largely confined to general correctness or rely on human scoring that is difficult to scale, making it challenging to assess fine-grained pedagogical capabilities—such as scaffolding and value integration—in long-tail instructional scenarios. This work proposes the Elmes* framework, which leverages a multi-agent teacher–student–evaluator interaction engine and a self-evolving SceneGen module to automatically construct and refine scenario-specific evaluation criteria. It presents the first automated, fine-grained scoring system tailored for long-tail educational contexts, enabling multidimensional diagnostic assessment and introducing Edu-330, a large-scale benchmark comprising 330 diverse scenarios. Experiments reveal significant performance disparities among leading models in creativity and value alignment, with the specialized model InnoSpark achieving the best results; furthermore, LLM-based evaluators yield rankings consistent with human judgments while exhibiting lower variance.
📝 Abstract
Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes*, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes* combines a declarative multi-agent engine for teacher--student--judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes*, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1{,}000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human--LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes* thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.