The cell as a token: high-dimensional geometry in language models and cell embeddings

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited interpretability and poor cross-batch robustness of high-dimensional embeddings in single-cell sequencing data. We establish, for the first time, a theoretical correspondence between the geometric structure of language model (LLM) token embeddings and the single-cell expression space, proposing a novel “geometry-driven cell embedding” paradigm. Methodologically, we treat cells as tokens and leverage LLM embedding analysis, interpretability probing, in-context reasoning, and manifold learning to model the low-dimensional geometric structure of cellular state space. Our contributions are threefold: (1) uncovering the regulatory role of manifold geometry in embedding robustness and interpretability; (2) constructing the first unified framework integrating LLM-based interpretability techniques with single-cell atlas construction; and (3) significantly improving generalizability and biological interpretability in cross-batch integration, cell-type annotation, and functional-state inference.

Technology Category

Application Category

📝 Abstract
Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. This process mirrors parallel developments in machine learning, where large language models ingest unstructured text by converting words into discrete tokens embedded within a high-dimensional vector space. This perspective explores how advances in understanding the structure of language embeddings can inform ongoing efforts to analyze and visualize single cell datasets. We discuss how the context of tokens influences the geometry of embedding space, and the role of low-dimensional manifolds in shaping this space's robustness and interpretability. We highlight new developments in language modeling, such as interpretability probes and in-context reasoning, that can inform future efforts to construct and consolidate cell atlases.
Problem

Research questions and friction points this paper is trying to address.

Understanding high-dimensional cell embeddings via language model techniques
Exploring token context impact on embedding space geometry
Applying language model advances to improve cell atlas construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-dimensional cell embeddings mirror language models
Token context shapes embedding space geometry
Interpretability probes enhance cell atlas construction