D$^{2}$Stream: Decoupled Dual-Stream Temporal-Speaker Interaction for Audio-Visual Speaker Detection
By: Junhao Xiao , Shun Feng , Zhiyu Wu and more
Audio-visual speaker detection aims to identify the active speaker in videos by leveraging complementary audio and visual cues. Existing methods often suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions. We propose D$^{2}$Stream, a decoupled dual-stream framework that separates cross-frame temporal modeling from within-frame speaker discrimination. Audio and visual features are first aligned via cross-modal attention, then fed into two lightweight streams: a Temporal Interaction Stream captures long-range temporal dependencies, while a Speaker Interaction Stream models per-frame inter-person relationships. The temporal and relational features extracted by the two streams interact via cross-attention to enrich representations. A lightweight Voice Gate module further mitigates false positives from non-speech facial movements. On AVA-ActiveSpeaker, D$^{2}$Stream achieves a new state-of-the-art at 95.6% mAP, with 80% reduction in computation compared to GNN-based models and 30% fewer parameters than attention-based alternatives, while also generalizing well on Columbia ASD. Source code is available at https://anonymous.4open.science/r/D2STREAM.
Similar Papers
DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis
Multimedia
Makes videos talk and have background sounds.
AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition
Multimedia
Helps computers understand talking even with loud noise.
GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
CV and Pattern Recognition
Finds who is talking in videos better.