Cross-modal Causal Relation Alignment for Video Question Grounding
By: Weixing Chen , Yang Liu , Binglin Chen and more
Potential Business Impact:
Helps computers find video clips for answers.
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
Similar Papers
Context Selection and Rewriting for Video-based Educational Question Generation
Computation and Language
Creates smart questions from real class videos.
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
CV and Pattern Recognition
Helps computers understand videos better by seeing relationships.
Question-Aware Gaussian Experts for Audio-Visual Question Answering
CV and Pattern Recognition
Helps computers answer questions about videos better.