Score: 0

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Published: August 6, 2025 | arXiv ID: 2508.04571v1

By: Claudio Pomo , Matteo Attimonelli , Danilo Danese and more

Potential Business Impact:

Makes movie suggestions better using pictures and words.

Plain English Summary

Imagine getting movie or product recommendations that are actually good, not just random guesses. This new method uses smart AI that understands both pictures and words together, like how you'd describe something. This means you'll get suggestions that truly match what you're looking for, making online shopping and entertainment much more enjoyable.

Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic depth and alignment encoded within LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation for building robust and meaningful multimodal representations in recommendation tasks.

MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

Information Retrieval

Suggests better movies and products you'll like.

21 Aug 2025 2

91%

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Information Retrieval

Helps video apps understand what you *really* like.

13 Aug 2025 2

91%

Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration

Information Retrieval

Finds images and text better, even in different languages.

21 Jan 2026 1

View PDF Login to Bookmark

Country of Origin

🇮🇹 Italy

Page Count

11 pages

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Makes movie suggestions better using pictures and words.

Plain English Summary

Technical Abstract

MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration