Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of procedural reasoning benchmarks in current AI learning systems that simultaneously support natural language expression, knowledge traceability, and multi-hop reasoning. The authors propose a question generation and validation framework grounded in the Task–Method–Knowledge (TMK) model, which systematically constructs high-quality question-answer pairs through TMK-guided generation, transcription-based textual constraints, posterior filtering, and verification via closed evidence units. This approach achieves an effective balance between natural language fluency and structured knowledge representation. Evaluated across 23 instructional topics yielding 690 question-answer pairs, the method demonstrates strong performance under strict TMK criteria, attaining a 96.5% knowledge grounding rate and a 92.6% usability rate—significantly outperforming baseline approaches.

📝 Abstract

Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

Problem

Research questions and friction points this paper is trying to address.

procedural reasoning

evaluation datasets

grounding

multi-hop reasoning

AI-supported learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural reasoning

grounding validation

TMK-based question generation