Score: 0

Trade-offs in Data Memorization via Strong Data Processing Inequalities

Published: June 2, 2025 | arXiv ID: 2506.01855v1

By: Vitaly Feldman, Guy Kornowski, Xin Lyu

Potential Business Impact:

Protects private info when computers learn.

Business Areas:

Big Data Data and Analytics

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.

Trustworthy Machine Learning via Memorization and the Granular Long-Tail: A Survey on Interactions, Tradeoffs, and Beyond

Machine Learning (CS)

Teaches computers to remember good and bad data.

10 Mar 2025 1

88%

Beyond Frequency: The Role of Redundancy in Large Language Model Memorization

Machine Learning (CS)

Makes AI forget private stuff, not important facts.

14 Jun 2025 1

88%

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

Computation and Language

Keeps private info safe when computers learn.

10 Aug 2025 0

View PDF Login to Bookmark

Page Count

39 pages

Trade-offs in Data Memorization via Strong Data Processing Inequalities

Protects private info when computers learn.

Technical Abstract

Trustworthy Machine Learning via Memorization and the Granular Long-Tail: A Survey on Interactions, Tradeoffs, and Beyond

Beyond Frequency: The Role of Redundancy in Large Language Model Memorization

Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models