Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
By: Rui Fonseca, Bruno Martins, Gil Rocha
Potential Business Impact:
Lets computers describe pictures using only words.
Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
Similar Papers
Multilingual Training-Free Remote Sensing Image Captioning
CV and Pattern Recognition
Lets computers describe satellite pictures in any language.
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
CV and Pattern Recognition
Makes AI understand pictures using only words.
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
CV and Pattern Recognition
Finds pictures using only words, not images.