Scalable Tree Ensemble Proximities in Python
By: Adrien Aumon , Guy Wolf , Kevin R. Moon and more
Potential Business Impact:
Finds similar items fast without using much computer memory.
Tree ensemble methods such as Random Forests naturally induce supervised similarity measures through their decision tree structure, but existing implementations of proximities derived from tree ensembles typically suffer from quadratic time or memory complexity, limiting their scalability. In this work, we introduce a general framework for efficient proximity computation by defining a family of Separable Weighted Leaf-Collision Proximities. We show that any proximity measure in this family admits an exact sparse matrix factorization, restricting computation to leaf-level collisions and avoiding explicit pairwise comparisons. This formulation enables low-memory, scalable proximity computation using sparse linear algebra in Python. Empirical benchmarks demonstrate substantial runtime and memory improvements over traditional approaches, allowing tree ensemble proximities to scale efficiently to datasets with hundreds of thousands of samples on standard CPU hardware.
Similar Papers
The Generalized Proximity Forest
Machine Learning (CS)
Helps computers find weird patterns in data.
Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning
Machine Learning (CS)
Helps wireless devices learn from different data.
Distance-based Learning of Hypertrees
Machine Learning (CS)
Learns complex connections faster than before.