Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations
By: Prachi J, Sumit Bhatia, Srikanta Bedathur
Potential Business Impact:
Helps computers find pictures by understanding words better.
Multimodal representations that enable cross-modal retrieval are widely used. However, these often lack interpretability making it difficult to explain the retrieved results. Solutions such as learning sparse disentangled representations are typically guided by the text tokens in the data, making the dimensionality of the resulting embeddings very high. We propose an approach that generates smaller dimensionality fixed-size embeddings that are not only disentangled but also offer better control for retrieval tasks. We demonstrate their utility using challenging exclusion queries over MSCOCO and Conceptual Captions benchmarks. Our experiments show that our approach is superior to traditional dense models such as CLIP, BLIP and VISTA (gains up to 11% in AP@10), as well as sparse disentangled models like VDR (gains up to 21% in AP@10). We also present qualitative results to further underline the interpretability of disentangled representations.
Similar Papers
Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval
Computation and Language
Helps computers find images using words better.
Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation
CV and Pattern Recognition
Helps computers see images better, not just understand them.
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
CV and Pattern Recognition
Find 3D objects using text descriptions.