π€ AI Summary
Clinical response exhibits natural variability, impeding accurate discrimination between responders and non-respondersβa key bottleneck in causal-driven analysis of response heterogeneity. To address this, we propose the Causal Two-Group (C2G) model, which formalizes treatment response as a latent variable and introduces two novel empirical Bayes approaches: semi-parametric and non-parametric. Under non-identifiability, we define a new estimand and develop an estimation interval strategy with rigorous theoretical guarantees. Integrating causal inference, latent variable modeling, and false discovery rate (FDR) control, C2G ensures strict FDR control while achieving near-optimal statistical power. Applied to cancer immunotherapy data, C2G successfully identifies clinically validated positive and negative biomarkers. Both theoretical analysis and empirical evaluation demonstrate its robustness and superiority over existing methods.
π Abstract
Scientists often need to analyze the samples in a study that responded to treatment in order to refine their hypotheses and find potential causal drivers of response. Natural variation in outcomes makes teasing apart responders from non-responders a statistical inference problem. To handle latent responses, we introduce the causal two-groups (C2G) model, a causal extension of the classical two-groups model. The C2G model posits that treated samples may or may not experience an effect, according to some prior probability. We propose two empirical Bayes procedures for the causal two-groups model, one under semi-parametric conditions and another under fully nonparametric conditions. The semi-parametric model assumes additive treatment effects and is identifiable from observed data. The nonparametric model is unidentifiable, but we show it can still be used to test for response in each treated sample. We show empirically and theoretically that both methods for selecting responders control the false discovery rate at the target level with near-optimal power. We also propose two novel estimands of interest and provide a strategy for deriving estimand intervals in the unidentifiable nonparametric model. On a cancer immunotherapy dataset, the nonparametric C2G model recovers clinically-validated predictive biomarkers of both positive and negative outcomes. Code is available at https://github.com/tansey-lab/causal2groups.