Sleep pattern profiling using a finite mixture of contaminated multivariate skew-normal distributions on incomplete data
By: Jason Pillay , Cristina Tortora , Antonio Punzo and more
Medical data often exhibit characteristics that make cluster analysis particularly challenging, such as missing values, outliers, and cluster features like skewness. Typically, such data would need to be preprocessed -- by cleaning outliers and missing values -- before clustering could be performed. However, these preliminary steps rely on objective functions different from those used in the clustering stage. In this paper, we propose a unified model-based clustering approach that simultaneously handles atypical observations, missing values, and cluster-wise skewness within a single framework. Each cluster is modelled using a contaminated multivariate skew-normal distribution -- a convenient two-component mixture of multivariate skew-normal densities -- in which one component represents the main data (the "bulk") and the other captures potential outliers. From an inferential perspective, we implement and use a variant of the EM algorithm to obtain the maximum likelihood estimates of the model parameters. Simulation studies demonstrate that the proposed model outperforms existing approaches in both clustering accuracy and outlier detection, across low- and high-dimensional settings, even in the presence of substantial missingness. The method is further applied to the Cleveland Children's Sleep and Health Study (CCSHS), a dataset characterised by incomplete observations. Without any preprocessing, the proposed approach identifies five distinct groups of sleepers, revealing meaningful differences in sleeper typologies.
Similar Papers
Cluster weighted models with multivariate skewed distributions for functional data
Methodology
Finds hidden patterns in complex data.
Extending finite mixture models with skew-normal distributions and hidden Markov models for time series
Methodology
Finds hidden patterns in changing data.
Semiparametric Robust Estimation of Population Location
Computation
Cleans up messy signals to find important information.