Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
By: Rui Cai , Bangzheng Li , Xiaofei Wen and more
Potential Business Impact:
Helps AI focus on important information, not distractions.
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model's inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method's effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.
Similar Papers
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
CV and Pattern Recognition
Teaches AI to trust the right information.
Survey of Adversarial Robustness in Multimodal Large Language Models
CV and Pattern Recognition
Makes AI understand pictures and words safely.
MLLMs are Deeply Affected by Modality Bias
Artificial Intelligence
Makes computers understand pictures and words equally.