Score: 1

Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval

Published: October 13, 2025 | arXiv ID: 2510.12014v1

By: Eric He , Akash Gupta , Adian Liusie and more

Potential Business Impact:

Finds personalized gifts using smart computer vision.

Business Areas:

Visual Search Internet Services

Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.

Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

Information Retrieval

Finds better products you'll like to buy.

15 Oct 2025 1

90%

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Information Retrieval

Helps video apps understand what you *really* like.

13 Aug 2025 2

90%

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

CV and Pattern Recognition

Finds pictures using only words, not images.

23 Sep 2025 2

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

9 pages

Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval

Finds personalized gifts using smart computer vision.

Technical Abstract

Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions