Score: 1

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Published: April 4, 2025 | arXiv ID: 2504.03970v2

By: Dahun Kim , AJ Piergiovanni , Ganesh Mallya and more

Potential Business Impact:

Helps computers understand video stories better.

Business Areas:

Video Editing Content and Publishing, Media and Entertainment, Video

We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

CV and Pattern Recognition

Makes AI draw pictures with many things correctly.

16 May 2025 1

89%

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

CV and Pattern Recognition

Creates videos from text, following instructions better.

18 Mar 2025 1

88%

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

CV and Pattern Recognition

Helps computers understand fast actions in videos.

24 Nov 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

11 pages

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Helps computers understand video stories better.

Technical Abstract

CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models