Score: 2

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Published: September 11, 2025 | arXiv ID: 2509.10572v2

By: Ashlesha Akella , Akshar Kaul , Krishnasuri Narayanam and more

BigTech Affiliations: IBM

Potential Business Impact:

Fixes messy data automatically for better computer use.

Business Areas:

Text Analytics Data and Analytics, Software

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Software Engineering

Cleans messy data automatically for better computer use.

11 Sep 2025 2

90%

A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Machine Learning (CS)

Makes fake computer data more like real data.

5 May 2025 1

90%

Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning

Computation and Language

Helps computers understand building rules from pictures.

23 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

huggingface.co huggingface.co huggingface.co

Page Count

31 pages

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Fixes messy data automatically for better computer use.

Technical Abstract

Quality Assessment of Tabular Data using Large Language Models and Code Generation

A Note on Statistically Accurate Tabular Data Generation Using Large Language Models

Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning