π€ AI Summary
This study evaluates the immediate causal effects of on-demand human tutoring within an adaptive learning system, addressing estimation biases arising from student self-selection into help-seeking and dynamic knowledge states. By modeling studentsβ latent proficiency using Deep Knowledge Tracing (DKT) and integrating it with a doubly robust causal forest estimator, the authors conduct a heterogeneous treatment effect analysis across more than 5,000 middle school mathematics tutoring sessions. This work presents the first integration of DKT with causal forests, enabling scalable, robust, and fine-grained causal inference. Results indicate that tutoring increases the probability of correctly answering the next problem by an average of 4 percentage points and improves subsequent skill accuracy by approximately 3 percentage points. Individual treatment effects are highly heterogeneous, ranging from β20.25 to +19.91 percentage points, with significantly larger benefits observed for students with lower prior mastery.
π Abstract
This paper introduces a scalable causal inference framework for estimating the immediate, session-level effects of on-demand human tutoring embedded within adaptive learning systems. Because students seek assistance at moments of difficulty, conventional evaluation is confounded by self-selection and time-varying knowledge states. We address these challenges by integrating principled analytic sample construction with Deep Knowledge Tracing (DKT) to estimate latent mastery, followed by doubly robust estimation using Causal Forests. Applying this framework to over 5,000 middle-school mathematics tutoring sessions, we find that requesting human tutoring increases next-problem correctness by approximately 4 percentage points and accuracy on the subsequent skill encountered by approximately 3 percentage points, suggesting that the effects of tutoring have proximal transfer across knowledge components. This effect is robust to various forms of model specification and potential unmeasured confounders. Notably, these effects exhibit significant heterogeneity across sessions and students, with session-level effect estimates ranging from $-20.25pp$ to $+19.91pp$. Our follow-up analyses suggest that typical behavioral indicators, such as student talk time, do not consistently correlate with high-impact sessions. Furthermore, treatment effects are larger for students with lower prior mastery and slightly smaller for low-SES students. This framework offers a rigorous, practical template for the evaluation and continuous improvement of on-demand human tutoring, with direct applications for emerging AI tutoring systems.