🤖 AI Summary
This study addresses the challenge posed by the combinatorial explosion in single-cell perturbation experiments, which hinders comprehensive exploration of cellular phenotypic mechanisms. To overcome this limitation, the authors propose the single-cell Concept Bottleneck Generative Model (scCBGM), the first adaptation of concept bottleneck architectures to single-cell data. By incorporating decoder skip connections and a cross-covariance penalty, scCBGM achieves disentangled representations without dimensional constraints and extends naturally to a flow-matching framework for precise counterfactual generation and editing. The method demonstrates strong compositional generalization and counterfactual prediction capabilities across multiple real-world datasets, with efficacy validated through both cell-level synthetic benchmarks—featuring ground-truth counterfactual labels—and population-level experimental data.
📝 Abstract
Understanding cellular phenotypes and how they respond to perturbations is critical for disease biology and therapeutic design. Single-cell RNA sequencing enables characterization at cellular resolution, yet the combinatorial space of conditions makes exhaustive experimental mapping infeasible. We introduce single-cell Concept Bottleneck Generative Models (scCBGM), a framework for interpretable and precise counterfactual editing of individual cells. scCBGM adapts concept bottleneck architectures for single-cell data through decoder skip connections and a cross-covariance penalty that promotes disentanglement without dimensional constraints. We extend the framework to flow matching models, enabling concept-guided editing in both encoding-decoding and generation regimes. To enable rigorous evaluation, we develop a synthetic benchmark with ground-truth counterfactuals. Across multiple real datasets, scCBGM demonstrates superior performance in combinatorial generalization and counterfactual prediction, supported by cell-level validation on synthetic data and population-level benchmarks on real datasets.