Open World Knowledge Aided Single-Cell Foundation Model with Robust Cross-Modal Cell-Language Pre-training
By: Haoran Wang , Xuanyi Zhang , Shuangsang Fang and more
Potential Business Impact:
Helps computers understand cells by reading their descriptions.
Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model's resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.
Similar Papers
Language-Enhanced Representation Learning for Single-Cell Transcriptomics
Machine Learning (CS)
Helps understand cells by combining gene data and text.
CellVerse: Do Large Language Models Really Understand Cell Biology?
Quantitative Methods
Helps computers understand cell biology like a language.
Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data
Machine Learning (CS)
Explains what cells are doing in plain English.