Score: 1

MM-IFEngine: Towards Multimodal Instruction Following

Published: April 10, 2025 | arXiv ID: 2504.07957v2

By: Shengyuan Ding , Shenxi Wu , Xiangyu Zhao and more

Potential Business Impact:

Teaches AI to follow picture instructions precisely.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.

A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Computation and Language

Teaches computers to follow instructions better.

12 May 2025 1

90%

When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

Computation and Language

Helps computers follow many commands better.

25 Sep 2025 1

89%

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Computation and Language

Tests AI that understands talking, seeing, and reading.

25 Jul 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

28 pages

MM-IFEngine: Towards Multimodal Instruction Following

Teaches AI to follow picture instructions precisely.

Technical Abstract

A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks