Nonparametric Inference on Unlabeled Histograms
By: Yun Ma, Pengkun Yang
Potential Business Impact:
Finds hidden patterns in data, even with missing pieces.
Statistical inference on histograms and frequency counts plays a central role in categorical data analysis. Moving beyond classical methods that directly analyze labeled frequencies, we introduce a framework that models the multiset of unlabeled histograms via a mixture distribution to better capture unseen domain elements in large-alphabet regime. We study the nonparametric maximum likelihood estimator (NPMLE) under this framework, and establish its optimal convergence rate under the Poisson setting. The NPMLE also immediately yields flexible and efficient plug-in estimators for functional estimation problems, where a localized variant further achieves the optimal sample complexity for a wide range of symmetric functionals. Extensive experiments on synthetic, real-world datasets, and large language models highlight the practical benefits of the proposed method.
Similar Papers
Parametric convergence rate of some nonparametric estimators in mixtures of power series distributions
Statistics Theory
Estimates mixed count patterns accurately.
EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions
Methodology
Finds hidden groups in data.
Besting Good--Turing: Optimality of Non-Parametric Maximum Likelihood for Distribution Estimation
Statistics Theory
Counts rare things better than old methods.