Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
By: Yuanli Wu , Long Zhang , Yue Du and more
Potential Business Impact:
Makes long videos shorter without losing important parts.
With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
Similar Papers
Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
CV and Pattern Recognition
Makes long videos short without watching them.
Prompts to Summaries: Zero-Shot Language-Guided Video Summarization
CV and Pattern Recognition
Makes videos shorter by asking questions.
Xiaoice: Training-Free Video Understanding via Self-Supervised Spatio-Temporal Clustering of Semantic Features
CV and Pattern Recognition
Makes computers understand videos without extra training.