DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking
By: Fang Wang , Tianwei Yan , Zonghao Yang and more
Potential Business Impact:
Helps computers understand pictures and words together.
Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
Similar Papers
PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking
Computation and Language
Helps computers understand pictures and words together.
Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models
CV and Pattern Recognition
Shows how computers see and understand pictures.
Harnessing Deep LLM Participation for Robust Entity Linking
Computation and Language
Helps computers understand names in text better.