Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
By: Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu
Potential Business Impact:
Helps computers understand mixed-up sounds and pictures better.
Current mainstream approaches to addressing multimodal imbalance primarily focus on architectural modifications and optimization-based, often overlooking a quantitative analysis of the imbalance degree between modalities. To address this gap, our work introduces a novel method for the quantitative analysis of multi-modal imbalance, which in turn informs the design of a sample-level adaptive loss function.We begin by defining the "Modality Gap" as the difference between the Softmax scores of different modalities (e.g., audio and visual) for the ground-truth class prediction. Analysis of the Modality Gap distribution reveals that it can be effectively modeled by a bimodal Gaussian Mixture Model (GMM). These two components are found to correspond respectively to "modality-balanced" and "modality-imbalanced" data samples. Subsequently, we apply Bayes' theorem to compute the posterior probability of each sample belonging to these two distinct distributions.Informed by this quantitative analysis, we design a novel adaptive loss function with three objectives: (1) to minimize the overall Modality Gap; (2) to encourage the imbalanced sample distribution to shift towards the balanced one; and (3) to apply greater penalty weights to imbalanced samples. We employ a two-stage training strategy consisting of a warm-up phase followed by an adaptive training phase.Experimental results demonstrate that our approach achieves state-of-the-art (SOTA) performance on the public CREMA-D and AVE datasets, attaining accuracies of $80.65\%$ and $70.90\%$, respectively. This validates the effectiveness of our proposed methodology.
Similar Papers
Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning
Machine Learning (CS)
Helps computers understand different kinds of information together.
Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise
Multimedia
Helps computers understand mixed information better.
Revisit Modality Imbalance at the Decision Layer
Machine Learning (CS)
Fixes AI that favors one sense over another.