ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning
By: Duc-Tai Dinh, Duc Anh Khoa Dinh
Potential Business Impact:
Helps computers describe pictures using article words.
We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition's data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at https://github.com/ductai05/ZSE-Cap.
Similar Papers
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
CV and Pattern Recognition
Makes picture descriptions tell a whole story.
Data-Efficient Generalization for Zero-shot Composed Image Retrieval
CV and Pattern Recognition
Finds pictures using text and other pictures.
SGCap: Decoding Semantic Group for Zero-shot Video Captioning
CV and Pattern Recognition
Lets computers describe any video without practice.