Score: 0

SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data

Published: June 2, 2025 | arXiv ID: 2506.01907v1

By: Yan Zhou, Bradley Malin, Murat Kantarcioglu

Potential Business Impact:

Makes private data useful without losing secrets.

Business Areas:
Predictive Analytics Artificial Intelligence, Data and Analytics, Software

Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.

Country of Origin
🇺🇸 United States

Page Count
16 pages

Category
Computer Science:
Machine Learning (CS)