Score: 1

Fast and explainable clustering in the Manhattan and Tanimoto distance

Published: January 13, 2026 | arXiv ID: 2601.08781v1

By: Stefan Güttel, Kaustubh Roy

Potential Business Impact:

Finds patterns in data much faster.

Business Areas:
Big Data Data and Analytics

The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.

Country of Origin
🇬🇧 United Kingdom

Repos / Data Links

Page Count
13 pages

Category
Computer Science:
Machine Learning (CS)