🤖 AI Summary
To address the substantial semantic gap between Cell Painting images and heterogeneous perturbations (e.g., small molecules, CRISPR), as well as the challenge of unifying their representations, this paper proposes the first cross-modal contrastive learning framework tailored for cellular phenotypic analysis. It innovatively introduces a microscopy-channel encoding mechanism to explicitly model fluorescence-channel-specific information, and jointly fine-tunes frozen ViT/ResNet image encoders with BERT-style text encoders—within a CLIP-inspired paradigm—to align perturbation descriptions with morphological phenotypes in a shared embedding space. The method significantly improves cross-modal retrieval accuracy, outperforms existing open-source models on downstream tasks including perturbation clustering and mechanism inference, and achieves a 3.2× speedup in inference—thereby overcoming the transfer bottleneck of natural-image pre-trained models in cellular imaging.
📝 Abstract
High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells' morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g., small molecule vs CRISPR gene knockout) in a single latent space. In response to these challenges, here we introduce CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP leverages pre-trained image encoders coupled with a novel channel encoding scheme to better capture relationships between different microscopy channels in image embeddings, along with natural language encoders for representing perturbations. Our framework outperforms current open-source models, demonstrating the best performance in both cross-modal retrieval and biologically meaningful downstream tasks while also achieving significant reductions in computation time.