Score: 1

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Published: May 26, 2025 | arXiv ID: 2505.20001v4

By: Shihao Li , Aihua Zheng , Andong Lu and more

Potential Business Impact:

Helps cameras find the same object in different pictures.

Business Areas:

Image Recognition Data and Analytics, Software

Multi-modal object Re-Identification (ReID) aims to obtain accurate identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarse-grained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained experts into the final identity representations. Extensive experiments on four public datasets demonstrate the effectiveness of our method and show that it significantly outperforms existing state-of-the-art methods.

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

CV and Pattern Recognition

Helps computers find objects using pictures and words.

13 Mar 2025 0

90%

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval

CV and Pattern Recognition

Finds exact pictures from text descriptions.

10 Apr 2025 1

89%

Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning

CV and Pattern Recognition

Helps computers find lost objects in different pictures.

21 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

21 pages

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Helps cameras find the same object in different pictures.

Technical Abstract

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval

Reliable Multi-Modal Object Re-Identification via Modality-Aware Graph Reasoning