EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs’) capacity to generate multimodal instructional content for K–12 STEM education. The authors construct a benchmark comprising 230 items spanning five disciplines and three grade bands, and propose the first standardized evaluation framework for educational multimodal content generation. Their approach employs a sequential anchoring protocol to ensure geometric consistency between text and images as well as coherent reasoning, and introduces an eight-dimensional scoring rubric grounded in multimedia learning theory. High reliability on objective dimensions is demonstrated through validation by both human experts and LLM-as-judge. Experiments reveal that Gemini 3.0 Pro Preview achieves the highest performance (87.8%), while Kimi-K2.5 offers the best cost efficiency (80.8% at $0.12 per item). Sequential anchoring improves visual consistency by 13% and reduces generation costs by 94%.
πŸ“ Abstract
Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8\%, while Kimi-K2.5 achieves the best cost-efficiency (80.8\% at \\$0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13\% at 94\% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions ($ρ\geq 0.83$) while revealing limitations on subjective visual assessment.
Problem

Research questions and friction points this paper is trying to address.

multimodal educational content
diagram-rich explanations
LLM evaluation
text-diagram generation
K-12 STEM
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal educational content
sequential anchoring
visual consistency
LLM evaluation benchmark
diagram-rich explanation
πŸ”Ž Similar Papers
No similar papers found.
S
Shuzhen Bi
Shanghai Innovation Institute, University of Science and Technology of China
M
Mingzi Zhang
East China Normal University
Z
Zhuoxuan Li
East China Normal University
X
Xiaolong Wang
East China Normal University
Keqian Li
Keqian Li
GenAI, Meta
Data miningmachine learning
A
Aimin Zhou
Shanghai Innovation Institute, East China Normal University