Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
By: Arun Reddy , Alexander Martin , Eugene Yang and more
Potential Business Impact:
Finds videos from written descriptions.
In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
Similar Papers
Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
CV and Pattern Recognition
Finds videos from text faster and better.
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
CV and Pattern Recognition
Finds the right video from text descriptions.
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
CV and Pattern Recognition
Helps computers understand videos better by reading them.