Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead
By: Viktor Schlegel , Anil A Bharath , Zilong Zhao and more
Potential Business Impact:
Creates fake data that keeps real secrets safe.
Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.
Similar Papers
Evaluating Differentially Private Generation of Domain-Specific Text
Machine Learning (CS)
Creates fake data that keeps real secrets safe.
Privacy-Preserving Fair Synthetic Tabular Data
Machine Learning (CS)
Creates private, fair data for sharing without bias.
Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines
Cryptography and Security
Makes private data safe for computer learning.