Clustering by Denoising: Latent plug-and-play diffusion for single-cell data
By: Dominik Meier , Shixing Yu , Sagnik Nandy and more
Potential Business Impact:
Helps scientists sort cells better.
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.
Similar Papers
Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models
Machine Learning (Stat)
Creates realistic cell data for science research.
scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge
Machine Learning (CS)
Makes cell data smaller, easier to share and use.
Clustering with Communication: A Variational Framework for Single Cell Representation Learning
Machine Learning (CS)
Helps cells talk to each other better.