OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
By: Yongxian Wei , Runxi Cheng , Weike Jin and more
Potential Business Impact:
Combines AI models to understand more things.
Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, $\textbf{(i)}$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $\textbf{(ii)}$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $\textbf{(iii)}$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.
Similar Papers
Training-free LLM Merging for Multi-task Learning
Computation and Language
Combines smart computer brains for more tasks.
Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
Computation and Language
Helps AI learn language like a baby.
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
CV and Pattern Recognition
Teaches AI to trust the right information.