Towards Operationalizing Heterogeneous Data Discovery
By: Jin Wang , Yanlin Feng , Chen Shen and more
Potential Business Impact:
Lets computers understand all kinds of messy data.
Querying and exploring massive collections of data sources, such as data lakes, has been an essential research topic in the database community. Although many efforts have been paid in the field of data discovery and data integration in data lakes, they mainly focused on the scenario where the data lake consists of structured tables. However, real-world enterprise data lakes are always more complicated, where there might be silos of multi-modal data sources with structured, semi-structured and unstructured data. In this paper, we envision an end-to-end system with declarative interface for querying and analyzing the multi-modal data lakes. First of all, we come up with a set of multi-modal operators, which is a unified interface that extends the relational operations with AI-composed ones to express analytical workloads over data sources in various modalities. In addition, we formally define the essential steps in the system, such as data discovery, query planning, query processing and results aggregation. On the basis of it, we then pinpoint the research challenges and discuss potential opportunities in realizing and optimizing them with advanced techniques brought by Large Language Models. Finally, we demonstrate our preliminary attempts to address this problem and suggest the future plan for this research topic.
Similar Papers
TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes
Databases
Helps computers understand all kinds of data better.
Declarative Techniques for NL Queries over Heterogeneous Data
Databases
Lets computers answer questions from many data sources.
Harnessing Rich Multi-Modal Data for Spatial-Temporal Homophily-Embedded Graph Learning Across Domains and Localities
Machine Learning (CS)
Connects city data to solve problems.