SPER: Accelerating Progressive Entity Resolution via Stochastic Bipartite Maximization
By: Dimitrios Karapiperis , George Papadakis , Themis Palpanas and more
Potential Business Impact:
Finds matching info faster in huge data.
Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects high-utility pairs in strictly linear time. Extensive experiments on eight real-world datasets demonstrate that SPER achieves significant speedups (3x to 6x) over state-of-the-art baselines while maintaining comparable recall and precision.
Similar Papers
Progressive Entity Resolution: A Design Space Exploration
Databases
Finds and groups similar information faster.
FastER: Fast On-Demand Entity Resolution in Property Graphs
Databases
Finds matching people in data much faster.
In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration
Databases
Groups similar online information faster and cheaper.