Score: 2

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Published: October 15, 2025 | arXiv ID: 2510.13515v1

By: Tiancheng Gu , Kaicheng Yang , Kaichen Zhang and more

Potential Business Impact:

Helps computers understand pictures and words better.

Business Areas:

Semantic Search Internet Services

Universal multimodal embedding models are foundational to various tasks. Existing approaches typically employ in-batch negative mining by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning and present a novel Universal Multimodal Embedding (UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2-Reranker, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance on average across all tasks.

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

CV and Pattern Recognition

Teaches computers to understand pictures and words better.

24 Apr 2025 1

90%

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Machine Learning (CS)

Helps computers understand pictures and words together.

1 Aug 2025 1

90%

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

CV and Pattern Recognition

Helps computers understand pictures and words together better.

4 Mar 2025 2

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

12 pages

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Helps computers understand pictures and words better.

Technical Abstract

Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning