TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes
By: Chao Zhang , Shaolei Zhang , Quehuan Liu and more
Potential Business Impact:
Helps computers understand all kinds of data better.
The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data analytics system. Specifically, we propose a novel architecture built upon the Model Context Protocol (MCP), an emerging paradigm that enables LLMs to collaborate with knowledgeable agents. First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes and develop an AI-agent-powered NL2Operator translator to bridge user intent and analytical execution. Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities. This design enhances both accuracy and efficiency, while supporting high scalability through modular deployment. Finally, we propose a updating mechanism by harnessing the deep research and machine unlearning techniques to refresh the data lakes and LLM knowledges, with the goal of balancing the data freshness and inference efficiency.
Similar Papers
Towards Operationalizing Heterogeneous Data Discovery
Databases
Lets computers understand all kinds of messy data.
Secure Multi-Modal Data Fusion in Federated Digital Health Systems via MCP
Cryptography and Security
Helps doctors use patient data securely for better health.
LLM/Agent-as-Data-Analyst: A Survey
Artificial Intelligence
Computers understand and analyze any kind of data.