Score: 0

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Published: May 6, 2025 | arXiv ID: 2505.03703v1

By: François Role, Sébastien Meyer, Victor Amblard

Potential Business Impact:

Fixes how computers understand pictures and words together.

Business Areas:

Visual Search Internet Services

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model

CV and Pattern Recognition

Sends pictures better by describing them with words.

25 Mar 2025 0

89%

Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples

Computation and Language

Makes searching text and pictures together work better.

27 Nov 2025 0

89%

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

Computation and Language

Helps computers understand spoken words better.

14 Oct 2025 0

View PDF Login to Bookmark

Page Count

13 pages

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Fixes how computers understand pictures and words together.

Technical Abstract

Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model

Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models