On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI
By: David Restrepo , Ira Ktena , Maria Vakalopoulou and more
Potential Business Impact:
Finds if AI trusts pictures or words more.
Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model's reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our findings highlight the importance of designing and evaluating multimodal medical models that genuinely integrate visual and textual cues, rather than relying on single-modality signals.
Similar Papers
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Computation and Language
Fixes computer understanding of mixed-up information.
Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict
Artificial Intelligence
Helps computers understand pictures and words better.
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
CV and Pattern Recognition
Computers trust words more than pictures.