A Note on Statistically Accurate Tabular Data Generation Using Large Language Models
By: Andrey Sidorenko
Potential Business Impact:
Makes fake computer data more like real data.
Large language models (LLMs) have shown promise in synthetic tabular data generation, yet existing methods struggle to preserve complex feature dependencies, particularly among categorical variables. This work introduces a probability-driven prompting approach that leverages LLMs to estimate conditional distributions, enabling more accurate and scalable data synthesis. The results highlight the potential of prompting probability distributions to enhance the statistical fidelity of LLM-generated tabular data.
Similar Papers
FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs
Machine Learning (CS)
Makes fake data faster and cheaper for computers.
SampleLLM: Optimizing Tabular Data Synthesis in Recommendations
Information Retrieval
Makes computer suggestions better with fake data.
SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering
Artificial Intelligence
Creates fake patient data for medical research.