Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval
By: Eric He , Akash Gupta , Adian Liusie and more
Potential Business Impact:
Finds personalized gifts using smart computer vision.
Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.
Similar Papers
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
Information Retrieval
Finds better products you'll like to buy.
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Information Retrieval
Helps video apps understand what you *really* like.
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
CV and Pattern Recognition
Finds pictures using only words, not images.