🤖 AI Summary
This study addresses a key limitation in traditional differential item functioning (DIF) analysis, which treats immutable group attributes—such as gender or race—as direct treatment variables, thereby obscuring underlying causal mechanisms and hindering targeted fairness interventions. Drawing on an interventionist causal framework, this work introduces Robins and Richardson’s treatment decomposition approach to educational assessment for the first time, reframing non-manipulable attributes through actionable mediators like English vocabulary unfamiliarity or classroom learning barriers. The authors formally define and identify separable DIF effects, leveraging causal machine learning methods—including causal forests and Bayesian additive regression trees—to detect bias in SAT and New York State exam items driven by specific, intervenable factors. Simulation studies demonstrate that the proposed approach achieves high estimation accuracy and robustness.
📝 Abstract
Differential item functioning (DIF) is a widely used statistical notion for identifying items that may disadvantage specific groups of test-takers. These groups are often defined by non-manipulable characteristics, e.g., gender, race/ethnicity, or English-language learner (ELL) status. While DIF can be framed as a causal fairness problem by treating group membership as the treatment variable, this invokes the long-standing controversy over the interpretation of causal effects for non-manipulable treatments. To better identify and interpret causal sources of DIF, this study leverages an interventionist approach using treatment decomposition proposed by Robins and Richardson (2010). Under this framework, we can decompose a non-manipulable treatment into intervening variables. For example, ELL status can be decomposed into English vocabulary unfamiliarity and classroom learning barriers, each of which influences the outcome through different causal pathways. We formally define separable DIF effects associated with these decomposed components, depending on the absence or presence of item impact, and provide causal identification strategies for each effect. We then apply the framework to biased test items in the SAT and Regents exams. We also provide formal detection methods using causal machine learning methods, namely causal forests and Bayesian additive regression trees, and demonstrate their performance through a simulation study. Finally, we discuss the implications of adopting interventionist approaches in educational testing practices.