Score: 0

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Published: December 16, 2025 | arXiv ID: 2512.14230v1

By: Divyansh Pareek, Sewoong Oh, Simon S. Du

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{η\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ηn}}$ in the large $η$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $η$ regime.

Train a Unified Multimodal Data Quality Classifier with Synthetic Data

CV and Pattern Recognition

Makes AI understand pictures and words better.

16 Oct 2025 3

87%

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Machine Learning (CS)

Keeps AI from learning dangerous secrets.

8 Aug 2025 2

86%

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Computation and Language

Reusing data makes AI learn better and faster.

10 Mar 2025 2

View PDF Login to Bookmark

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Technical Abstract

Train a Unified Multimodal Data Quality Classifier with Synthetic Data

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality