Score: 0

M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

Published: August 17, 2025 | arXiv ID: 2508.12458v1

By: Ruirui Gao , Emily Johnson , Bowen Tan and more

Potential Business Impact:

Teaches AI to follow picture instructions better.

Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and preference alignment. Traditional supervised fine-tuning (SFT) and existing preference optimization methods like RLHF and DPO frequently struggle to efficiently leverage the model's own generation space to identify highly informative "hard negative" samples. To address these challenges, we propose Multimodal-Model-Guided Preference Optimization (M3PO), a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following. M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates. This selection is driven by a sophisticated mechanism that integrates two crucial signals: a Multimodal Alignment Score (MAS) to assess external quality and the model's Self-Consistency / Confidence (log-probability) to gauge internal belief. These are combined into a novel M3P-Score, which specifically identifies preferred responses and challenging dispreferred responses that the model might confidently generate despite being incorrect. These high-quality preference pairs are then used for efficient Direct Preference Optimization (DPO) fine-tuning on base LVLMs like LLaVA-1.5 (7B/13B) using LoRA. Our extensive experiments demonstrate that M3PO consistently outperforms strong baselines, including SFT, simulated RLHF, vanilla DPO, and RM-DPO, across a comprehensive suite of multimodal instruction following benchmarks (MME-Bench, POPE, IFT, Human Pref. Score).

Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

Information Retrieval

Helps computers pick what you'll like better.

24 Nov 2025 0

91%

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Machine Learning (CS)

Teaches AI to understand pictures and words better.

8 Sep 2025 0

91%

Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models

CV and Pattern Recognition

Makes AI better at understanding medical pictures.

25 Jan 2026 1

View PDF Login to Bookmark

Page Count

20 pages

M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

Teaches AI to follow picture instructions better.

Technical Abstract

Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models