Score: 1

Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

Published: December 26, 2025 | arXiv ID: 2512.21964v1

By: Dunyuan XU , Xikai Yang , Yaoqian Li and more

Potential Business Impact:

Makes AI doctors better at understanding messy medical images.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Machine Learning (CS)

Helps AI focus on important information, not distractions.

26 May 2025 1

90%

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

CV and Pattern Recognition

Teaches AI to trust the right information.

28 Nov 2025 1

90%

On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

CV and Pattern Recognition

Makes AI better at seeing medical pictures with flaws.

21 May 2025 0

View PDF Login to Bookmark

Page Count

16 pages

Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

Makes AI doctors better at understanding messy medical images.

Technical Abstract

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?