Score: 0

MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval

Published: September 22, 2025 | arXiv ID: 2509.17359v1

By: Tianyuan Li , Lei Wang , Ahtamjan Ahmat and more

Potential Business Impact:

Lets computers find images using smart descriptions.

Business Areas:

Semantic Search Internet Services

Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or scalability.To address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model's generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

CV and Pattern Recognition

Finds fake pictures and shows why.

8 Jun 2025 0

89%

Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction

CV and Pattern Recognition

Helps doctors write X-ray reports faster.

10 Apr 2025 1

89%

Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs

Information Retrieval

Helps find data faster by writing better descriptions.

12 Mar 2025 0

View PDF Login to Bookmark

Page Count

10 pages

MLLM-Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval

Lets computers find images using smart descriptions.

Technical Abstract

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction

Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs