Score: 0

Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

Published: November 24, 2025 | arXiv ID: 2511.18740v1

By: Yu Wang , Yonghui Yang , Le Wu and more

Potential Business Impact:

Helps computers pick what you'll like better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Machine Learning (CS)

Teaches AI to understand pictures and words better.

8 Sep 2025 0

91%

M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

Computation and Language

Teaches AI to follow picture instructions better.

17 Aug 2025 0

91%

MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

Information Retrieval

Suggests better movies and products you'll like.

21 Aug 2025 2

View PDF Login to Bookmark

Page Count

11 pages

Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

Helps computers pick what you'll like better.

Technical Abstract

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems