GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries
By: Shuyang Hou , Haoyue Jiao , Ziqi Liu and more
Potential Business Impact:
Helps computers understand map questions and find answers.
Large language models (LLMs) have shown strong performance in natural language to SQL (NL2SQL) tasks within general databases. However, extending to GeoSQL introduces additional complexity from spatial data types, function invocation, and coordinate systems, which greatly increases generation and execution difficulty. Existing benchmarks mainly target general SQL, and a systematic evaluation framework for GeoSQL is still lacking. To fill this gap, we present GeoSQL-Eval, the first end-to-end automated evaluation framework for PostGIS query generation, together with GeoSQL-Bench, a benchmark for assessing LLM performance in NL2GeoSQL tasks. GeoSQL-Bench defines three task categories-conceptual understanding, syntax-level SQL generation, and schema retrieval-comprising 14,178 instances, 340 PostGIS functions, and 82 thematic databases. GeoSQL-Eval is grounded in Webb's Depth of Knowledge (DOK) model, covering four cognitive dimensions, five capability levels, and twenty task types to establish a comprehensive process from knowledge acquisition and syntax generation to semantic alignment, execution accuracy, and robustness. We evaluate 24 representative models across six categories and apply the entropy weight method with statistical analyses to uncover performance differences, common error patterns, and resource usage. Finally, we release a public GeoSQL-Eval leaderboard platform for continuous testing and global comparison. This work extends the NL2GeoSQL paradigm and provides a standardized, interpretable, and extensible framework for evaluating LLMs in spatial database contexts, offering valuable references for geospatial information science and related applications.
Similar Papers
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks
Computation and Language
Helps computers understand maps and locations better.
Evaluating NL2SQL via SQL2NL
Computation and Language
Makes AI better understand different ways of asking questions.
GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation
Software Engineering
Tests computers that map the world.