Score: 0

Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations

Published: April 4, 2025 | arXiv ID: 2504.03184v3

By: Prachi J, Sumit Bhatia, Srikanta Bedathur

Potential Business Impact:

Helps computers find pictures by understanding words better.

Business Areas:
Semantic Search Internet Services

Multimodal representations that enable cross-modal retrieval are widely used. However, these often lack interpretability making it difficult to explain the retrieved results. Solutions such as learning sparse disentangled representations are typically guided by the text tokens in the data, making the dimensionality of the resulting embeddings very high. We propose an approach that generates smaller dimensionality fixed-size embeddings that are not only disentangled but also offer better control for retrieval tasks. We demonstrate their utility using challenging exclusion queries over MSCOCO and Conceptual Captions benchmarks. Our experiments show that our approach is superior to traditional dense models such as CLIP, BLIP and VISTA (gains up to 11% in AP@10), as well as sparse disentangled models like VDR (gains up to 21% in AP@10). We also present qualitative results to further underline the interpretability of disentangled representations.

Country of Origin
🇮🇳 India

Page Count
8 pages

Category
Computer Science:
Information Retrieval