Score: 0

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Published: January 12, 2026 | arXiv ID: 2601.07645v1

By: Zijing Wang , Yongkang Liu , Mingyang Wang and more

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

CV and Pattern Recognition

Shows how AI understands pictures and words.

27 Aug 2025 0

90%

Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Computation and Language

Helps AI learn language like a baby.

2 Oct 2025 2

90%

Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

Computation and Language

Makes AI understand many languages better.

25 Aug 2025 1

View PDF Login to Bookmark

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Technical Abstract

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models