🤖 AI Summary
This study investigates the feasibility of recovering causal relationships among genes from aggregated bulk gene expression data. By formalizing the notion of causal recoverability, the authors introduce two key criteria—functional form consistency and conditional independence consistency—and rigorously establish their necessary and sufficient conditions: accurate causal recovery is possible only when gene regulation follows linear aggregation and affine structural equation models. The theoretical analysis integrates causal inference with a functional consistency framework and is empirically validated on multiple bulk and single-cell datasets. Results demonstrate that real gene regulatory mechanisms are predominantly nonlinear, thereby revealing a fundamental limitation of current causal inference methods that rely solely on bulk expression data.
📝 Abstract
Bulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.