EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
By: Dinh-Khoi Vo , Van-Loc Nguyen , Minh-Triet Tran and more
Potential Business Impact:
Finds specific pictures from long, tricky stories.
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.
Similar Papers
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
CV and Pattern Recognition
Makes picture descriptions tell a whole story.
Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva
CV and Pattern Recognition
Adds missing story details to picture descriptions.
Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
CV and Pattern Recognition
Finds pictures from words, even tricky ones.