DaQ-MSA: Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis
By: Jiazhang Liang , Jianheng Dai , Miaosen Luo and more
Potential Business Impact:
Makes AI understand feelings in videos better.
Multimodal large language models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their effectiveness on multimodal sentiment analysis remains constrained by the scarcity of high-quality training data, which limits accurate multimodal understanding and generalization. To alleviate this bottleneck, we leverage diffusion models to perform semantics-preserving augmentation on the video and audio modalities, expanding the multimodal training distribution. However, increasing data quantity alone is insufficient, as diffusion-generated samples exhibit substantial quality variation and noisy augmentations may degrade performance. We therefore propose DaQ-MSA (Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis), which introduces a quality scoring module to evaluate the reliability of augmented samples and assign adaptive training weights. By down-weighting low-quality samples and emphasizing high-fidelity ones, DaQ-MSA enables more stable learning. By integrating the generative capability of diffusion models with the semantic understanding of MLLMs, our approach provides a robust and generalizable automated augmentation strategy for training MLLMs without any human annotation or additional supervision.
Similar Papers
Confidence-Aware Self-Distillation for Multimodal Sentiment Analysis with Incomplete Modalities
Machine Learning (CS)
Helps computers understand feelings even with missing info.
MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
CV and Pattern Recognition
Helps computers understand feelings from faces, voices, words.
Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection
CV and Pattern Recognition
Makes computers understand feelings from videos better.