Efficient and Stable Multi-Dimensional Kolmogorov-Smirnov Distance
By: Peter Matthew Jacobs, Foad Namjoo, Jeff M. Phillips
Potential Business Impact:
Finds if two sets of data are different.
We revisit extending the Kolmogorov-Smirnov distance between probability distributions to the multidimensional setting and make new arguments about the proper way to approach this generalization. Our proposed formulation maximizes the difference over orthogonal dominating rectangular ranges (d-sided rectangles in R^d), and is an integral probability metric. We also prove that the distance between a distribution and a sample from the distribution converges to 0 as the sample size grows, and bound this rate. Moreover, we show that one can, up to this same approximation error, compute the distance efficiently in 4 or fewer dimensions; specifically the runtime is near-linear in the size of the sample needed for that error. With this, we derive a delta-precision two-sample hypothesis test using this distance. Finally, we show these metric and approximation properties do not hold for other popular variants.
Similar Papers
Sample Complexity of Nonparametric Closeness Testing for Continuous Distributions and Its Application to Causal Discovery with Hidden Confounding
Machine Learning (CS)
Finds cause and effect in complex data.
The Kolmogorov-Smirnov Statistic Revisited
Statistics Theory
Tests if data groups are the same.
On the Kolmogorov Distance of Max-Stable Distributions
Probability
Finds patterns in extreme weather events.