Score: 1

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Published: July 7, 2025 | arXiv ID: 2507.04635v1

By: Zhicheng Zhang , Wuyou Xia , Chenxi Zhao and more

Potential Business Impact:

Helps computers understand pictures and words better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

CV and Pattern Recognition

Helps computers understand pictures better with words.

2 Jun 2025 1

88%

ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion

Artificial Intelligence

Helps computers understand feelings even with missing clues.

8 Jul 2025 1

88%

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification

Computation and Language

Helps computers understand feelings from voice, face, and words.

14 Jan 2025 0

View PDF Login to Bookmark

Page Count

12 pages

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Helps computers understand pictures and words better.

Technical Abstract

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion

Dynamic Multimodal Sentiment Analysis: Leveraging Cross-Modal Attention for Enabled Classification