Statistical Foundations of DIME: Risk Estimation for Practical Index Selection
By: Giulio D'Erasmo , Cesare Campagnano , Antonio Mallia and more
Potential Business Impact:
Shrinks computer memory for faster searching.
High-dimensional dense embeddings have become central to modern Information Retrieval, but many dimensions are noisy or redundant. Recently proposed DIME (Dimension IMportance Estimation), provides query-dependent scores to identify informative components of embeddings. DIME relies on a costly grid search to select a priori a dimensionality for all the query corpus's embeddings. Our work provides a statistically grounded criterion that directly identifies the optimal set of dimensions for each query at inference time. Experiments confirm achieving parity of effectiveness and reduces embedding size by an average of $\sim50\%$ across different models and datasets at inference time.
Similar Papers
Information-Theoretic Quality Metric of Low-Dimensional Embeddings
Machine Learning (CS)
Finds hidden information loss in data maps.
Accurate Estimation of Mutual Information in High Dimensional Data
Data Analysis, Statistics and Probability
Makes computers understand data better, even messy data.
Likelihood-Preserving Embeddings for Statistical Inference
Machine Learning (Stat)
Keeps math results the same after data shrinking.