Automatic Metadata Extraction for Text-to-SQL
By: Vladislav Shkapenyuk , Divesh Srivastava , Theodore Johnson and more
Potential Business Impact:
Helps computers understand data without experts.
Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from decades of experience that the most difficult part of query development lies in understanding the database contents. These experiences inform the direction of our research. Text-to-SQL benchmarks such as SPIDER and Bird contain extensive metadata that is generally not available in practice. Human-generated metadata requires the use of expensive Subject Matter Experts (SMEs), who are often not fully aware of many aspects of their databases. In this paper, we explore techniques for automatic metadata extraction to enable text-to-SQL generation. We explore the use of two standard and one newer metadata extraction techniques: profiling, query log analysis, and SQL-to text generation using an LLM. We use BIRD benchmark [JHQY+23] to evaluate the effectiveness of these techniques. BIRD does not provide query logs on their test database, so we prepared a submission that uses profiling alone, and does not use any specially tuned model (we used GPT-4o). From Sept 1 to Sept 23, 2024, and Nov 11 through Nov 23, 2024 we achieved the highest score both with and without using the "oracle" information provided with the question set. We regained the number 1 spot on Mar 11, 2025, and are still at #1 at the time of the writing (May, 2025).
Similar Papers
Meta-aware Learning in text-to-SQL Large Language Model
Artificial Intelligence
Helps computers understand business data better.
SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation
Artificial Intelligence
Lets computers understand any database questions.
Text-to-SQL based on Large Language Models and Database Keyword Search
Databases
Helps computers understand messy questions for databases.