Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery
By: Ananad Krishnakumar, Vengadesh Ravikumaran
Potential Business Impact:
Finds similar spreadsheets automatically.
Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.
Similar Papers
Statistical Inference for Manifold Similarity and Alignability across Noisy High-Dimensional Datasets
Statistics Theory
Compares complex data by looking at its hidden shapes.
Efficient Identification of High Similarity Clusters in Polygon Datasets
Machine Learning (CS)
Finds similar places faster in huge maps.
Metric Matters: A Formal Evaluation of Similarity Measures in Active Learning for Cyber Threat Intelligence
Machine Learning (CS)
Finds hidden computer spies faster by learning what's normal.