A tree-based kernel for densities and its applications in clustering DNase-seq profiles
By: Yuliang Xu, Kaixuan Luo, Li Ma
Potential Business Impact:
Finds DNA patterns to understand how genes turn on.
Modeling multiple sampling densities within a hierarchical framework enables borrowing of information across samples. These density random effects can act as kernels in latent variable models to represent exchangeable subgroups or clusters. A key feature of these kernels is the (functional) covariance they induce, which determines how densities are grouped in mixture models. Our motivating problem is clustering chromatin accessibility profiles from high-throughput DNase-seq experiments to detect transcription factor (TF) binding. TF binding typically produces footprint profiles with spatial patterns, creating long-range dependency across genomic locations. Existing nonparametric hierarchical models impose restrictive covariance assumptions and cannot accommodate such dependencies, often leading to biologically uninformative clusters. We propose a nonparametric density kernel flexible enough to capture diverse covariance structures and adaptive to various spatial patterns of TF footprints. The kernel specifies dyadic tree splitting probabilities via a multivariate logit-normal model with a sparse precision matrix. Bayesian inference for latent variable models using this kernel is implemented through Gibbs sampling with Polya-Gamma augmentation. Extensive simulations show that our kernel substantially improves clustering accuracy. We apply the proposed mixture model to DNase-seq data from the ENCODE project, which results in biologically meaningful clusters corresponding to binding events of two common TFs.
Similar Papers
Data-Dependent Smoothing for Protein Discovery with Walk-Jump Sampling
Machine Learning (CS)
Makes AI create better protein designs.
Kernel Density Balancing
Applications
Improves how we see inside cells.
Entropic Analysis of Time Series through Kernel Density Estimation
Information Theory
Finds hidden patterns in signals and heartbeats.