Score: 2

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Published: September 11, 2025 | arXiv ID: 2509.10572v2

By: Ashlesha Akella , Akshar Kaul , Krishnasuri Narayanam and more

BigTech Affiliations: IBM

Potential Business Impact:

Fixes messy data automatically for better computer use.

Business Areas:
Text Analytics Data and Analytics, Software

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

Country of Origin
🇺🇸 United States


Page Count
31 pages

Category
Computer Science:
Software Engineering