Score: 0

Targeted learning via probabilistic subpopulation matching

Published: December 26, 2025 | arXiv ID: 2512.21840v1

By: Xiaokang Liu , Jie Hu , Naimin Jing and more

In biomedical research, to obtain more accurate prediction results from a target study, leveraging information from multiple similar source studies is proved to be useful. However, in many biomedical applications based on real-world data, populations under consideration in different studies, e.g., clinical sites, can be heterogeneous, leading to challenges in properly borrowing information towards the target study. The state of art methods are typically based on study-level matching to identify source studies that are similar to the target study, whilst samples from source studies that significantly differ from the target study will all be dropped at the study level, which can lead to substantial loss of information. We consider a general situation where all studies are sampled from a super-population composed of distinct subpopulations, and propose a novel framework of targeted learning via subpopulation matching. In contrast to the existing study-level matching methods, measuring similarities between subpopulations can effectively decompose both within- and between-study heterogeneity, allowing incorporation of information from all source studies without dropping any samples as in the existing methods. We devise the proposed framework as a two-step procedure, where a finite mixture model is first fitted jointly across all studies to provide subject-wise probabilistic subpopulation information, followed by a step of within-subpopulation information transferring from source studies to the target study for each identified subpopulation. We establish the non-asymptotic properties of our estimator and demonstrate the ability of our method to improve prediction at the target study via simulation studies.

Category
Statistics:
Methodology