Score: 1

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Published: November 13, 2025 | arXiv ID: 2511.10192v2

By: Qifeng Cai , Hao Liang , Chang Xu and more

Potential Business Impact:

Makes AI better at understanding and writing database questions.

Business Areas:
Text Analytics Data and Analytics, Software

The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

Country of Origin
🇨🇳 China

Repos / Data Links

Page Count
14 pages

Category
Computer Science:
Computation and Language