Score: 0

Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations

Published: April 4, 2025 | arXiv ID: 2504.03184v3

By: Prachi J, Sumit Bhatia, Srikanta Bedathur

Potential Business Impact:

Helps computers find pictures by understanding words better.

Business Areas:

Semantic Search Internet Services

Multimodal representations that enable cross-modal retrieval are widely used. However, these often lack interpretability making it difficult to explain the retrieved results. Solutions such as learning sparse disentangled representations are typically guided by the text tokens in the data, making the dimensionality of the resulting embeddings very high. We propose an approach that generates smaller dimensionality fixed-size embeddings that are not only disentangled but also offer better control for retrieval tasks. We demonstrate their utility using challenging exclusion queries over MSCOCO and Conceptual Captions benchmarks. Our experiments show that our approach is superior to traditional dense models such as CLIP, BLIP and VISTA (gains up to 11% in AP@10), as well as sparse disentangled models like VDR (gains up to 21% in AP@10). We also present qualitative results to further underline the interpretability of disentangled representations.

Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval

Computation and Language

Helps computers find images using words better.

22 Aug 2025 1

88%

Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation

CV and Pattern Recognition

Helps computers see images better, not just understand them.

4 Mar 2025 1

88%

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

CV and Pattern Recognition

Find 3D objects using text descriptions.

2 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇮🇳 India

Page Count

8 pages

Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations

Helps computers find pictures by understanding words better.

Technical Abstract

Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval

Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction