🤖 AI Summary
Single-cell RNA sequencing (scRNA-seq) data pose significant challenges—including high dimensionality, extreme sparsity, strong batch effects, class imbalance, and rapidly increasing scale—that hinder cross-center knowledge transfer and integration. To address these, we propose scDD, a latent-code-driven data distillation framework. scDD is the first to jointly encode foundational model priors and raw scRNA-seq data into a variational latent space. It introduces SCDG, a single-step conditional diffusion generator that circumvents multi-step backpropagation-induced gradient decay while preserving both generation fidelity and inter-class separability. Furthermore, we establish the first comprehensive, multi-task evaluation benchmark explicitly designed to assess distillation efficacy. Extensive experiments demonstrate that scDD achieves an average absolute accuracy gain of 7.61% and a relative performance improvement of 15.70% across diverse downstream analyses—significantly outperforming state-of-the-art methods.
📝 Abstract
Single-cell RNA sequencing (scRNA-seq) technology has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. However, the high-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale of the original sequencing data pose significant challenges for multi-center knowledge transfer, data fusion, and cross-validation between scRNA-seq datasets. To address these barriers, (1) we first propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which transfers and distills foundation model knowledge and original dataset information into a compact latent space and generates synthetic scRNA-seq dataset by a generator to replace the original dataset. Then, (2) we propose a single-step conditional diffusion generator named SCDG, which perform single-step gradient back-propagation to help scDD optimize distillation quality and avoid gradient decay caused by multi-step back-propagation. Meanwhile, SCDG ensures the scRNA-seq data characteristics and inter-class discriminability of the synthetic dataset through flexible conditional control and generation quality assurance. Finally, we propose a comprehensive benchmark to evaluate the performance of scRNA-seq dataset distillation in different data analysis tasks. It is validated that our proposed method can achieve 7.61% absolute and 15.70% relative improvement over previous state-of-the-art methods on average task.