Dependency-aware synthetic tabular data generation
By: Chaithra Umesh , Kristian Schultz , Manjunath Mahendra and more
Potential Business Impact:
Makes fake health data keep real health rules.
Synthetic tabular data is increasingly used in privacy-sensitive domains such as health care, but existing generative models often fail to preserve inter-attribute relationships. In particular, functional dependencies (FDs) and logical dependencies (LDs), which capture deterministic and rule-based associations between features, are rarely or often poorly retained in synthetic datasets. To address this research gap, we propose the Hierarchical Feature Generation Framework (HFGF) for synthetic tabular data generation. We created benchmark datasets with known dependencies to evaluate our proposed HFGF. The framework first generates independent features using any standard generative model, and then reconstructs dependent features based on predefined FD and LD rules. Our experiments on four benchmark datasets with varying sizes, feature imbalance, and dependency complexity demonstrate that HFGF improves the preservation of FDs and LDs across six generative models, including CTGAN, TVAE, and GReaT. Our findings demonstrate that HFGF can significantly enhance the structural fidelity and downstream utility of synthetic tabular data.
Similar Papers
Reinforcement Learning-based Feature Generation Algorithm for Scientific Data
Machine Learning (CS)
Automates making data smarter for better science.
Assessing Generative Models for Structured Data
Machine Learning (CS)
Makes fake data that looks like real data.
Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models
Machine Learning (CS)
Creates fake patient data for heart research.