A robust contaminated discrete Weibull regression model for outlier-prone count data
By: Divan A. Burger, Janet van Niekerk, Emmanuel Lesaffre
Potential Business Impact:
Helps predict rare events better in data.
Count data often exhibit overdispersion driven by heavy tails or excess zeros, making standard models (e.g., Poisson, negative binomial) insufficient for handling outlying observations. We propose a novel contaminated discrete Weibull (cDW) framework that augments a baseline discrete Weibull (DW) distribution with a heavier-tail subcomponent. This mixture retains a single shifted-median parameter for a unified regression link while selectively assigning extreme outcomes to the heavier-tail subdistribution. The cDW distribution accommodates strictly positive data by setting the truncation limit c=1 as well as full-range counts with c=0. We develop a Bayesian regression formulation and describe posterior inference using Markov chain Monte Carlo sampling. In an application to hospital length-of-stay data (with c=1, meaning the minimum possible stay is 1), the cDW model more effectively captures extreme stays and preserves the median-based link. Simulation-based residual checks, leave-one-out cross-validation, and a Kullback-Leibler outlier assessment confirm that the cDW model provides a more robust fit than the single-component DW model, reducing the influence of outliers and improving predictive accuracy. A simulation study further demonstrates the cDW model's robustness in the presence of heavy contamination. We also discuss how a hurdle scheme can accommodate datasets with many zeros while preventing the spurious inflation of zeros in situations without genuine zero inflation.
Similar Papers
Modeling Bounded Count Environmental Data Using a Contaminated Beta-Binomial Regression Model
Methodology
Helps climate studies use extreme weather data.
Flexible model for varying levels of zeros and outliers in count data
Methodology
Better counts for tricky data with many zeros.
A right-truncated Poisson mixture model for analyzing count data
Methodology
Finds why people followed COVID rules.