Quality Assessment of Tabular Data using Large Language Models and Code Generation
By: Ashlesha Akella , Akshar Kaul , Krishnasuri Narayanam and more
Potential Business Impact:
Cleans messy data automatically for better computer use.
Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.
Similar Papers
Quality Assessment of Tabular Data using Large Language Models and Code Generation
Software Engineering
Fixes messy data automatically for better computer use.
Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning
Computation and Language
Helps computers understand building rules from pictures.
Agentic LLMs for Question Answering over Tabular Data
Computation and Language
Answers questions from complex tables using smart computer language.