SGCap: Decoding Semantic Group for Zero-shot Video Captioning
By: Zeyu Pan, Ping Li, Wenxiao Wang
Potential Business Impact:
Lets computers describe any video without practice.
Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at https://github.com/mlvccn/SGCap_Video.
Similar Papers
Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation
CV and Pattern Recognition
Teaches computers to see new things from descriptions.
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
CV and Pattern Recognition
Lets computers understand pictures better.
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning
CV and Pattern Recognition
Draw a box, get many picture descriptions.