Mitigating Semantic Collapse in Partially Relevant Video Retrieval
By: WonJun Moon , MinSeok Jung , Gilhan Park and more
Potential Business Impact:
Finds video clips matching only parts of your search.
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
Similar Papers
Enhancing Partially Relevant Video Retrieval with Robust Alignment Learning
CV and Pattern Recognition
Finds video parts matching fuzzy search terms.
Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction
CV and Pattern Recognition
Finds video clips matching text descriptions.
Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval
CV and Pattern Recognition
Finds video clips matching text descriptions better.