Score: 1

MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification

Published: August 26, 2025 | arXiv ID: 2508.19424v1

By: Yifan Dou , Adam Khadre , Ruben C Petreaca and more

Potential Business Impact:

Groups cancers by how they change.

Business Areas:
Bioinformatics Biotechnology, Data and Analytics, Science and Engineering

Motivation. Understanding the pan-cancer mutational landscape offers critical insights into the molecular mechanisms underlying tumorigenesis. While patient-level machine learning techniques have been widely employed to identify tumor subtypes, cohort-level clustering, where entire cancer types are grouped based on shared molecular features, has largely relied on classical statistical methods. Results. In this study, we introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types based on coding mutation data derived from the COSMIC database. For each cancer type, we construct two complementary mutation signatures: a gene-level profile capturing nucleotide substitution patterns across the most frequently mutated genes, and a chromosome-level profile representing normalized substitution frequencies across chromosomes. These dual views are encoded using TabNet encoders and optimized via a multi-scale contrastive learning objective (NT-Xent loss) to learn unified cancer-type embeddings. We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types, aligning with known mutational processes and tissue origins. Our work represents the first application of contrastive learning to cohort-level cancer clustering, offering a scalable and interpretable framework for mutation-driven cancer subtyping.

Country of Origin
🇺🇸 United States

Repos / Data Links

Page Count
8 pages

Category
Computer Science:
Machine Learning (CS)