Score: 0

Object-Centric Framework for Video Moment Retrieval

Published: December 20, 2025 | arXiv ID: 2512.18448v1

By: Zongyao Li , Yongkang Wong , Satoshi Yamazaki and more

Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

CV and Pattern Recognition

Finds specific video moments using smart searching.

15 Dec 2025 0

89%

Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval

Information Retrieval

Finds video clips using many search words.

6 Dec 2025 0

89%

Aligning Moments in Time using Video Queries

CV and Pattern Recognition

Finds matching moments in one video using another.

21 Aug 2025 4

View PDF Login to Bookmark

Object-Centric Framework for Video Moment Retrieval

Technical Abstract

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval

Aligning Moments in Time using Video Queries