Bootstrapping Learned Cost Models with Synthetic SQL Queries
By: Michael Nidd , Christoph Miksovic , Thomas Gschwind and more
Potential Business Impact:
Makes computer programs test databases faster.
Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that when enough diverse SQL queries are available, one can effectively and efficiently predict the cost of running a given query against a specific database engine. In this paper, we describe our experience in exploiting modern synthetic data generation techniques, inspired by the generative AI and LLM community, to create high-quality datasets enabling the effective training of such learned cost models. Initial results show that we can improve a learned cost model's predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.
Similar Papers
Automated Training of Learned Database Components with Generative AI
Databases
Makes computer databases learn faster with fake data.
Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation
Computation and Language
Teaches computers to understand and write database questions.
Redefining Cost Estimation in Database Systems: The Role of Execution Plan Features and Machine Learning
Databases
Helps computers guess how long database tasks take.