Score: 2

PANDA-PLUS-Bench: A Clinical Benchmark for Evaluating Robustness of AI Foundation Models in Prostate Cancer Diagnosis

Published: December 16, 2025 | arXiv ID: 2512.14922v1

By: Joshua L. Ebbert, Dennis Della Corte

Potential Business Impact:

Tests AI to find real cancer, not fake clues.

Business Areas:
Image Recognition Data and Analytics, Software

Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level confounders. Our results reveal substantial variation in robustness across models: Virchow2 achieved the lowest slide-level encoding among large-scale models (81.0%) yet exhibited the second-lowest cross-slide accuracy (47.2%). HistoEncoder, trained specifically on prostate tissue, demonstrated the highest cross-slide accuracy (59.7%) and the strongest slide-level encoding (90.3%), suggesting tissue-specific training may enhance both biological feature capture and slide-specific signatures. All models exhibited measurable within-slide vs. cross-slide accuracy gaps, though the magnitude varied from 19.9 percentage points to 26.9 percentage points. We provide an open-source Google Colab notebook enabling researchers to evaluate additional foundation models against our benchmark using standardized metrics. PANDA-PLUS-Bench addresses a critical gap in foundation model evaluation by providing a purpose-built resource for robustness assessment in the clinically important context of Gleason grading.

Country of Origin
🇺🇸 United States


Page Count
21 pages

Category
Computer Science:
CV and Pattern Recognition