Score: 2

MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

Published: August 1, 2025 | arXiv ID: 2508.00576v1

By: Zhanliang Wang, Kai Wang

Potential Business Impact:

Shows why AI mixes pictures and words.

Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their "black-box" nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - "why the model makes a specific prediction on this input", and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - "how the model integrates information across modalities". Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.

Rethinking Explainability in the Era of Multimodal AI

Artificial Intelligence

Explains how different data types work together.

16 Jun 2025 1

88%

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Machine Learning (CS)

Helps understand how AI uses different information.

6 Aug 2025 0

87%

Here Comes the Explanation: A Shapley Perspective on Multi-contrast Medical Image Segmentation

Image and Video Processing

Helps doctors understand how AI finds brain tumors.

6 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

19 pages

MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

Shows why AI mixes pictures and words.

Technical Abstract

Rethinking Explainability in the Era of Multimodal AI

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Here Comes the Explanation: A Shapley Perspective on Multi-contrast Medical Image Segmentation