Score: 0

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

Published: December 9, 2025 | arXiv ID: 2512.08371v1

By: Simon Chung , Colby J. Vorland , Donna L. Maney and more

Potential Business Impact:

Helps find rare information in big data.

Business Areas:
A/B Testing Data and Analytics

Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

Page Count
10 pages

Category
Computer Science:
Machine Learning (CS)