π€ AI Summary
This work addresses the challenges of low sample efficiency and reward function design in deep reinforcement learning for intelligent tutoring systems, which hinder effective modeling of studentsβ dynamically evolving learning strategies. To overcome these limitations, the authors propose THEMES, a framework based on generalized apprenticeship learning that introduces, for the first time, a time-varying multidimensional reward function. By leveraging only 18 expert demonstration trajectories from historical semester data, THEMES jointly integrates inverse reinforcement learning and policy generalization to accurately capture the temporal complexity and non-stationarity of teaching strategies. Evaluated on the task of predicting subsequent-semester instructional decisions, the method achieves an AUC of 0.899 and a Jaccard index of 0.653, significantly outperforming six state-of-the-art baselines and demonstrating strong efficacy and generalization capability.
π Abstract
Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) have advanced rapidly in recent years and have been successfully applied to e-learning environments like intelligent tutoring systems (ITSs). Despite great success, the broader application of DRL to educational technologies has been limited due to major challenges such as sample inefficiency and difficulty designing the reward function. In contrast, Apprenticeship Learning (AL) uses a few expert demonstrations to infer the expert's underlying reward functions and derive decision-making policies that generalize and replicate optimal behavior. In this work, we leverage a generalized AL framework, THEMES, to induce effective pedagogical policies by capturing the complexities of the expert student learning process, where multiple reward functions may dynamically evolve over time. We evaluate the effectiveness of THEMES against six state-of-the-art baselines, demonstrating its superior performance and highlighting its potential as a powerful alternative for inducing effective pedagogical policies and show that it can achieve high performance, with an AUC of 0.899 and a Jaccard of 0.653, using only 18 trajectories of a previous semester to predict student pedagogical decisions in a later semester.