🤖 AI Summary
High-quality instructional dialogue data is critically scarce due to privacy sensitivities and the high cost of expert involvement. To address this, we propose an expert-participatory generation framework: leveraging large language models (LLMs) to simulate novice teachers with diverse personality profiles, engaging them in multi-turn, realistic pedagogical dialogues with human education experts; integrating a personality modulation mechanism and real-time expert feedback loops to ensure ecological validity and privacy preservation. Using this framework, we construct a high-fidelity dataset and perform instruction tuning on LLaMA. The resulting expert model significantly outperforms GPT-4o in instructional relevance, cognitive depth, and reflective questioning ability, as validated by expert evaluation. This work introduces the first LLM-driven “personified novice” simulation paradigm—enabling scalable, high-fidelity, and ethically grounded training data infrastructure for educational AI systems.
📝 Abstract
High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding -- the process by which an expert supports a novice's thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM's persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o's limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.