Score: 0

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

Published: December 9, 2025 | arXiv ID: 2512.08371v1

By: Simon Chung , Colby J. Vorland , Donna L. Maney and more

Potential Business Impact:

Helps find rare information in big data.

Business Areas:

A/B Testing Data and Analytics

Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

Estimation of Bivariate Normal Distributions from Marginal Summaries in Clinical Trials

Methodology

Finds hidden patterns without seeing private data.

4 Aug 2025 0

85%

A Systematic Literature Review on Multi-label Data Stream Classification

Machine Learning (CS)

Helps computers sort many kinds of information fast.

24 Aug 2025 1

85%

Systematic Alias Sampling: an efficient and low-variance way to sample from a discrete distribution

Data Structures and Algorithms

Makes computers pick random numbers much faster.

28 Sep 2025 0

View PDF Login to Bookmark

Page Count

10 pages

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

Helps find rare information in big data.

Technical Abstract

Estimation of Bivariate Normal Distributions from Marginal Summaries in Clinical Trials

A Systematic Literature Review on Multi-label Data Stream Classification

Systematic Alias Sampling: an efficient and low-variance way to sample from a discrete distribution