Score: 0

A robust contaminated discrete Weibull regression model for outlier-prone count data

Published: April 13, 2025 | arXiv ID: 2504.09536v1

By: Divan A. Burger, Janet van Niekerk, Emmanuel Lesaffre

Potential Business Impact:

Helps predict rare events better in data.

Business Areas:
A/B Testing Data and Analytics

Count data often exhibit overdispersion driven by heavy tails or excess zeros, making standard models (e.g., Poisson, negative binomial) insufficient for handling outlying observations. We propose a novel contaminated discrete Weibull (cDW) framework that augments a baseline discrete Weibull (DW) distribution with a heavier-tail subcomponent. This mixture retains a single shifted-median parameter for a unified regression link while selectively assigning extreme outcomes to the heavier-tail subdistribution. The cDW distribution accommodates strictly positive data by setting the truncation limit c=1 as well as full-range counts with c=0. We develop a Bayesian regression formulation and describe posterior inference using Markov chain Monte Carlo sampling. In an application to hospital length-of-stay data (with c=1, meaning the minimum possible stay is 1), the cDW model more effectively captures extreme stays and preserves the median-based link. Simulation-based residual checks, leave-one-out cross-validation, and a Kullback-Leibler outlier assessment confirm that the cDW model provides a more robust fit than the single-component DW model, reducing the influence of outliers and improving predictive accuracy. A simulation study further demonstrates the cDW model's robustness in the presence of heavy contamination. We also discuss how a hurdle scheme can accommodate datasets with many zeros while preventing the spurious inflation of zeros in situations without genuine zero inflation.

Page Count
20 pages

Category
Statistics:
Methodology