Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
By: Bob Zhang , Haoran Li , Tao Zhang and more
Potential Business Impact:
Helps computers understand many pictures together.
Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions, revealing limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, yielding improvements of +9.04% on MIG-Bench, +6.37% on MC-Bench, and +4.98% on several out-of-domain reasoning grounding benchmarks compared to the SFT baseline. Furthermore, our method exhibits strong generalization in multi-image perception, with gains of +3.1% and +2.4% over the base model on BLINK and MMIU benchmarks, respectively.
Similar Papers
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Computation and Language
Teaches computers to understand pictures and words better.
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning
Machine Learning (CS)
Teaches AI to solve hard math and logic problems.
Interleaved Reasoning for Large Language Models via Reinforcement Learning
Computation and Language
Makes smart computers answer questions faster.