Hybrid Hypergraph Networks for Multimodal Sequence Data Classification
By: Feng Xu , Hui Wang , Yuting Huang and more
Potential Business Impact:
Helps computers understand videos with sound better.
Modeling temporal multimodal data poses significant challenges in classification tasks, particularly in capturing long-range temporal dependencies and intricate cross-modal interactions. Audiovisual data, as a representative example, is inherently characterized by strict temporal order and diverse modalities. Effectively leveraging the temporal structure is essential for understanding both intra-modal dynamics and inter-modal correlations. However, most existing approaches treat each modality independently and rely on shallow fusion strategies, which overlook temporal dependencies and hinder the model's ability to represent complex structural relationships. To address the limitation, we propose the hybrid hypergraph network (HHN), a novel framework that models temporal multimodal data via a segmentation-first, graph-later strategy. HHN splits sequences into timestamped segments as nodes in a heterogeneous graph. Intra-modal structures are captured via hyperedges guided by a maximum entropy difference criterion, enhancing node heterogeneity and structural discrimination, followed by hypergraph convolution to extract high-order dependencies. Inter-modal links are established through temporal alignment and graph attention for semantic fusion. HHN achieves state-of-the-art (SOTA) results on four multimodal datasets, demonstrating its effectiveness in complex classification tasks.
Similar Papers
Simple and Efficient Heterogeneous Temporal Graph Neural Network
Machine Learning (CS)
Makes computers understand changing online connections faster.
Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification
Sound
Helps computers understand sounds and sights together.
Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks
Machine Learning (CS)
Helps computers understand online groups better.