Score: 0

Concentration and excess risk bounds for imbalanced classification with synthetic oversampling

Published: October 23, 2025 | arXiv ID: 2510.20472v1

By: Touqeer Ahmad , Mohammadreza M. Kalan , François Portier and more

Potential Business Impact:

Helps computers learn better from unfair data.

Business Areas:

A/B Testing Data and Analytics

Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings

SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

Cryptography and Security

Exposes private information in fake data.

16 Oct 2025 0

89%

Bias-Corrected Data Synthesis for Imbalanced Learning

Machine Learning (Stat)

Fixes computer guessing when most examples are wrong.

30 Oct 2025 0

88%

Large Language Models for Imbalanced Classification: Diversity makes the difference

Machine Learning (CS)

Makes computer learning better with more varied examples.

10 Oct 2025 1

View PDF Login to Bookmark

Page Count

35 pages

Concentration and excess risk bounds for imbalanced classification with synthetic oversampling

Helps computers learn better from unfair data.

Technical Abstract

SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

Bias-Corrected Data Synthesis for Imbalanced Learning

Large Language Models for Imbalanced Classification: Diversity makes the difference