The Information Theory of Similarity
By: Nikit Phadke
Potential Business Impact:
Makes computers understand how alike things are.
We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research -- from Bloom filters to locality-sensitive hashing to neural retrieval -- implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA's $O(Δ^{-2} \log N)$ complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon's channel coding theorem.
Similar Papers
REWA: Witness-Overlap Theory -- Foundations for Composable Binary Similarity Systems
Machine Learning (CS)
Finds similar things using fewer computer bits.
REWA: A General Theory of Witness-Based Similarity
Machine Learning (CS)
Unites many ways to find similar things.
Unifying Information-Theoretic and Pair-Counting Clustering Similarity
Machine Learning (Stat)
Unifies ways to check how well computer groups match.