A Discrepancy-Based Perspective on Dataset Condensation
By: Tong Chen, Raghavendra Selvan
Potential Business Impact:
Makes small fake data teach computers like big real data.
Given a dataset of finitely many elements $\mathcal{T} = \{\mathbf{x}_i\}_{i = 1}^N$, the goal of dataset condensation (DC) is to construct a synthetic dataset $\mathcal{S} = \{\tilde{\mathbf{x}}_j\}_{j = 1}^M$ which is significantly smaller ($M \ll N$) such that a model trained from scratch on $\mathcal{S}$ achieves comparable or even superior generalization performance to a model trained on $\mathcal{T}$. Recent advances in DC reveal a close connection to the problem of approximating the data distribution represented by $\mathcal{T}$ with a reduced set of points. In this work, we present a unified framework that encompasses existing DC methods and extend the task-specific notion of DC to a more general and formal definition using notions of discrepancy, which quantify the distance between probability distribution in different regimes. Our framework broadens the objective of DC beyond generalization, accommodating additional objectives such as robustness, privacy, and other desirable properties.
Similar Papers
Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
Machine Learning (CS)
Makes medical data usable without real patient info.
Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws
Machine Learning (CS)
Makes AI learn from less data, faster.
Dataset Condensation with Color Compensation
CV and Pattern Recognition
Makes computer learning better by fixing image colors.