GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
By: Yu Wang , Juhyung Ha , Frangil M. Ramirez and more
Potential Business Impact:
Finds who is talking in videos better.
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
Similar Papers
DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion
Sound
Separates sounds from videos better.
AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes
CV and Pattern Recognition
Helps self-driving cars see better in bad weather.
Ensembling Synchronisation-based and Face-Voice Association Paradigms for Robust Active Speaker Detection in Egocentric Recordings
Multimedia
Helps computers find who's talking in messy videos.