Score: 0

MMSRARec: Summarization and Retrieval Augumented Sequential Recommendation Based on Multimodal Large Language Model

Published: December 24, 2025 | arXiv ID: 2512.20916v1

By: Haoyu Wang, Yitong Wang, Jining Wang

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant potential in recommendation systems. However, the effective application of MLLMs to multimodal sequential recommendation remains unexplored: A) Existing methods primarily leverage the multimodal semantic understanding capabilities of pre-trained MLLMs to generate item embeddings or semantic IDs, thereby enhancing traditional recommendation models. These approaches generate item representations that exhibit limited interpretability, and pose challenges when transferring to language model-based recommendation systems. B) Other approaches convert user behavior sequence into image-text pairs and perform recommendation through multiple MLLM inference, incurring prohibitive computational and time costs. C) Current MLLM-based recommendation systems generally neglect the integration of collaborative signals. To address these limitations while balancing recommendation performance, interpretability, and computational cost, this paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation. Specifically, we first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty, thereby enabling adaptive adjustment of the summarization policy. Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context. Finally, we apply supervised fine-tuning with multi-task learning to align the MLLM with the multimodal sequential recommendation. Extensive evaluations on common recommendation datasets demonstrate the effectiveness of MMSRARec, showcasing its capability to efficiently and interpretably understand user behavior histories and item information for accurate recommendations.

A Remarkably Efficient Paradigm to Multimodal Large Language Models for Sequential Recommendation

Information Retrieval

Makes online shopping suggestions faster and smarter.

8 Nov 2025 1

93%

MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

Information Retrieval

Suggests better movies and products you'll like.

21 Aug 2025 2

92%

A Survey on Large Language Models in Multimodal Recommender Systems

Information Retrieval

Helps computers suggest movies and products better.

14 May 2025 1

View PDF Login to Bookmark

MMSRARec: Summarization and Retrieval Augumented Sequential Recommendation Based on Multimodal Large Language Model

Technical Abstract

A Remarkably Efficient Paradigm to Multimodal Large Language Models for Sequential Recommendation

MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

A Survey on Large Language Models in Multimodal Recommender Systems