🤖 AI Summary
This study addresses the challenge of limited morphological interpretability and poor generalizability in transcriptomic representations due to the scarcity of weakly paired multimodal data (transcriptomics + microscopy images). To this end, we propose Semi-Clipped, a cross-modal knowledge distillation framework integrated with Perturbation Embedding Augmentation (PEA). Without requiring strongly aligned labels, Semi-Clipped leverages a pretrained Vision Transformer (ViT), contrastive learning, and multimodal alignment losses to distill morphological knowledge from histopathological images into gene expression embeddings. To our knowledge, this is the first work enabling morphology–transcriptome joint representation learning under weak supervision. Evaluated on multiple cell response prediction tasks—including drug response and perturbation effect estimation—Semi-Clipped achieves state-of-the-art performance, demonstrating superior generalization, robustness to input perturbations, and enhanced gene-level interpretability.
📝 Abstract
Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.