Multimodal Representation Alignment for Cross-modal Information Retrieval
By: Fan Xu, Luis A. Leiva
Potential Business Impact:
Finds matching pictures for words, and words for pictures.
Different machine learning models can represent the same underlying concept in different ways. This variability is particularly valuable for in-the-wild multimodal retrieval, where the objective is to identify the corresponding representation in one modality given another modality as input. This challenge can be effectively framed as a feature alignment problem. For example, given a sentence encoded by a language model, retrieve the most semantically aligned image based on features produced by an image encoder, or vice versa. In this work, we first investigate the geometric relationships between visual and textual embeddings derived from both vision-language models and combined unimodal models. We then align these representations using four standard similarity metrics as well as two learned ones, implemented via neural networks. Our findings indicate that the Wasserstein distance can serve as an informative measure of the modality gap, while cosine similarity consistently outperforms alternative metrics in feature alignment tasks. Furthermore, we observe that conventional architectures such as multilayer perceptrons are insufficient for capturing the complex interactions between image and text representations. Our study offers novel insights and practical considerations for researchers working in multimodal information retrieval, particularly in real-world, cross-modal applications.
Similar Papers
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning
Machine Learning (CS)
Helps computers understand pictures and words better.
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance
Machine Learning (CS)
Makes AI learn better by balancing information.
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance
Machine Learning (CS)
Makes AI better by controlling how it learns from different information.