A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
By: Simon Chung , Colby J. Vorland , Donna L. Maney and more
Potential Business Impact:
Helps find rare information in big data.
Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
Similar Papers
Estimation of Bivariate Normal Distributions from Marginal Summaries in Clinical Trials
Methodology
Finds hidden patterns without seeing private data.
A Systematic Literature Review on Multi-label Data Stream Classification
Machine Learning (CS)
Helps computers sort many kinds of information fast.
Systematic Alias Sampling: an efficient and low-variance way to sample from a discrete distribution
Data Structures and Algorithms
Makes computers pick random numbers much faster.