Score: 0

Scaling Language-Free Visual Representation Learning

Published: April 1, 2025 | arXiv ID: 2504.01017v1

By: David Fan , Shengbang Tong , Jiachen Zhu and more

Potential Business Impact:

Visual learning now matches language learning for AI.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Computation and Language

Helps computers understand many languages better.

22 Sep 2025 1

90%

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

CV and Pattern Recognition

Makes computers judge video quality better, faster.

8 Aug 2025 0

89%

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

CV and Pattern Recognition

Helps computers understand pictures better by focusing on important parts.

14 Mar 2025 0

View PDF Login to Bookmark

Page Count

21 pages

Scaling Language-Free Visual Representation Learning

Visual learning now matches language learning for AI.

Technical Abstract

Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection