Score: 1

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Published: August 1, 2025 | arXiv ID: 2508.00955v1

By: Yeong-Joon Ju, Seong-Whan Lee

Potential Business Impact:

Helps computers understand pictures and words together.

Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

CV and Pattern Recognition

Teaches computers to understand pictures and words better.

24 Apr 2025 1

90%

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

CV and Pattern Recognition

Helps computers understand pictures and words better.

15 Oct 2025 2

90%

The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

Artificial Intelligence

Teaches AI to understand pictures and words better.

14 Apr 2025 0

View PDF Login to Bookmark

Page Count

12 pages

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Helps computers understand pictures and words together.

Technical Abstract

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance