From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limited pedagogical value in educational question generation (EQG), and a rigorous, domain-specific evaluation framework—particularly for Chinese—remains absent. Method: We introduce EQGBench, the first Chinese multidisciplinary EQG benchmark, covering mathematics, physics, and chemistry with 900 authentic classroom-derived instances. We propose a novel five-dimensional evaluation framework assessing knowledge coverage, difficulty gradation, question-type diversity, pedagogical value, and holistic competency development. Leveraging user-query-driven generation augmented with fine-grained knowledge tagging and controlled difficulty scaling, we systematically evaluate 46 state-of-the-art LLMs. Contribution/Results: Our evaluation reveals substantial deficiencies in current LLMs’ ability to generate pedagogically effective, competency-oriented educational questions. EQGBench fills a critical gap in systematic EQG assessment, providing a reproducible benchmark and concrete, actionable directions for advancing EQG model design and pedagogical alignment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs' performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students' comprehensive abilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate educational questions
Assessing pedagogical value in Chinese question generation
Measuring educational effectiveness across multiple disciplines
Innovation

Methods, ideas, or system contributions that make the work stand out.

EQGBench benchmark for Chinese EQG evaluation
Five-dimensional framework with 900 samples
Evaluates 46 models for educational question quality
🔎 Similar Papers
No similar papers found.
C
Chengliang Zhou
School of Artificial Intelligence, Beijing Normal University
Mei Wang
Mei Wang
Beijing Normal University
face recognitionfairness in AIdomain adaptation
T
Ting Zhang
School of Artificial Intelligence, Beijing Normal University
Qiannan Zhu
Qiannan Zhu
School of Artificial Intelligence, Beijing Normal University
knowledge graphrecommendation systeminformation retrieval
J
Jian Li
School of Artificial Intelligence, Beijing Normal University
H
Hua Huang
School of Artificial Intelligence, Beijing Normal University