Automatic Generation of Inference Making Questions for Reading Comprehension Assessments

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the high manual construction cost and uneven coverage of reasoning-type items in reading comprehension diagnostic assessments. We propose a human-in-the-loop automated generation method grounded in a novel reading reasoning typology framework. Leveraging GPT-4o with few-shot prompting—including chain-of-thought reasoning—we generate diagnostic questions requiring cross-sentence coreference resolution or external knowledge integration, tailored for grades three through twelve. Rigorous three-dimensional human evaluation—assessing question quality, alignment with target reasoning types, and LLM reasoning plausibility—demonstrates feasibility: 93.8% of generated items meet operational usability standards, with inter-annotator reliability exceeding 0.90. Although precise alignment with intended reasoning types reaches 42.6%, results confirm this paradigm as a scalable, high-quality pathway for diagnostic item generation.

Technology Category

Application Category

📝 Abstract
Inference making is an essential but complex skill in reading comprehension (RC). Some inferences require resolving references across sentences, and some rely on using prior knowledge to fill in the detail that is not explicitly written in the text. Diagnostic RC questions can help educators provide more effective and targeted reading instruction and interventions for school-age students. We introduce a taxonomy of inference types for RC and use it to analyze the distribution of items within a diagnostic RC item bank. Next, we present experiments using GPT-4o to generate bridging-inference RC items for given reading passages via few-shot prompting, comparing conditions with and without chain-of-thought prompts. Generated items were evaluated on three aspects: overall item quality, appropriate inference type, and LLM reasoning, achieving high inter-rater agreements above 0.90. Our results show that GPT-4o produced 93.8% good-quality questions suitable for operational use in grade 3-12 contexts; however, only 42.6% of the generated questions accurately matched the targeted inference type. We conclude that combining automatic item generation with human judgment offers a promising path toward scalable, high-quality diagnostic RC assessments.
Problem

Research questions and friction points this paper is trying to address.

Generating inference-making questions for reading comprehension assessments
Ensuring questions match targeted inference types accurately
Combining automatic generation with human judgment for scalable assessments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using GPT-4o for automatic question generation
Few-shot prompting with chain-of-thought
Combining AI generation with human evaluation
🔎 Similar Papers
No similar papers found.