Evaluating Joinable Column Discovery Approaches for Context-Aware Search
By: Harsha Kokel , Aamod Khatiwada , Tejaswini Pedapati and more
Potential Business Impact:
Finds matching data across different computer files.
Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for different data characteristics and how multiple criteria influence discovery effectiveness. We present a comprehensive experimental evaluation of joinable column discovery methods across diverse scenarios. Our study compares syntactic and semantic techniques on seven benchmarks covering relational databases and data lakes. We analyze six key criteria -- unique values, intersection size, join size, reverse join size, value semantics, and metadata semantics -- and examine how combining them through ensemble ranking affects performance. Our analysis reveals differences in method behavior across data contexts and highlights the benefits of integrating multiple criteria for robust join discovery. We provide empirical evidence on when each criterion matters, compare pre-trained embedding models for semantic joins, and offer practical guidelines for selecting suitable methods based on dataset characteristics. Our findings show that metadata and value semantics are crucial for data lakes, size-based criteria play a stronger role in relational databases, and ensemble approaches consistently outperform single-criterion methods.
Similar Papers
Contextual Graph Embeddings: Accounting for Data Characteristics in Heterogeneous Data Integration
Databases
Helps computers combine different data faster.
Featurized-Decomposition Join: Low-Cost Semantic Joins with Guarantees
Databases
Helps computers join text faster and cheaper.
JOINT: Join Optimization and Inference via Network Traversal
Databases
Connects scattered, messy data automatically.