Score: 1

Improving Survival Models in Healthcare by Balancing Imbalanced Cohorts: A Novel Approach

Published: October 2, 2025 | arXiv ID: 2510.02137v1

By: Catherine Ning , Dimitris Bertsimas , Johan Gagnière and more

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Improves doctor predictions for rare patient groups.

Business Areas:

A/B Testing Data and Analytics

We explore whether survival model performance in underrepresented high- and low-risk subgroups - regions of the prognostic spectrum where clinical decisions are most consequential - can be improved through targeted restructuring of the training dataset. Rather than modifying model architecture, we propose a novel risk-stratified sampling method that addresses imbalances in prognostic subgroup density to support more reliable learning in underrepresented tail strata. We introduce a novel methodology that partitions patients by baseline prognostic risk and applies matching within each stratum to equalize representation across the risk distribution. We implement this framework on a cohort of 1,799 patients with resected colorectal liver metastases (CRLM), including 1,197 who received adjuvant chemotherapy and 602 who did not. All models used in this study are Cox proportional hazards models trained on the same set of selected variables. Model performance is assessed via Harrell's C index, time-dependent AUC, and Integrated Calibration Index (ICI), with internal validation using Efron's bias-corrected bootstrapping. External validation is conducted on two independent CRLM datasets. Cox models trained on risk-balanced cohorts showed consistent improvements in internal validation compared to models trained on the full dataset while noticeably enhancing stratified C-index values in underrepresented high- and low-risk strata of the external cohorts. Our findings suggest that survival model performance in observational oncology cohorts can be meaningfully improved through targeted rebalancing of the training data across prognostic risk strata. This approach offers a practical and model-agnostic complement to existing methods, especially in applications where predictive reliability across the full risk continuum is critical to downstream clinical decisions.

Unsupervised risk factor identification across cancer types and data modalities via explainable artificial intelligence

Machine Learning (CS)

Finds sick people who will get better or worse.

15 Jun 2025 0

87%

Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study

Machine Learning (CS)

Finds best computer models to predict health risks.

11 Mar 2025 0

87%

Lung Cancer Survival Prediction Using Machine Learning and Statistical Methods

Applications

Predicts lung cancer survival better for patients.

29 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

14 pages

Improving Survival Models in Healthcare by Balancing Imbalanced Cohorts: A Novel Approach

Improves doctor predictions for rare patient groups.

Technical Abstract

Unsupervised risk factor identification across cancer types and data modalities via explainable artificial intelligence

Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study

Lung Cancer Survival Prediction Using Machine Learning and Statistical Methods