Score: 0

Fusion or Confusion? Multimodal Complexity Is Not All You Need

Published: December 28, 2025 | arXiv ID: 2512.22991v1

By: Tillmann Rheude, Roland Eils, Benjamin Wild

Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating them across nine diverse datasets with up to 23 modalities, and testing their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a straightforward late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analysis indicates that more complex methods perform comparably to SimBaMM and frequently do not reliably outperform well-tuned unimodal baselines, especially in the small-data regime considered in many original studies. To support our findings, we include a case study of a recent multimodal learning method highlighting the methodological shortcomings in the literature. In addition, we provide a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.

Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Computation and Language

Helps computers understand feelings from talking, seeing, and hearing.

5 May 2025 0

89%

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

CV and Pattern Recognition

Teaches AI to trust the right information.

28 Nov 2025 1

89%

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

Machine Learning (CS)

Helps understand how AI uses different information.

6 Aug 2025 0

View PDF Login to Bookmark

Fusion or Confusion? Multimodal Complexity Is Not All You Need

Technical Abstract

Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models