Score: 0

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

Published: March 12, 2025 | arXiv ID: 2503.09416v1

By: Qi Liu , Weiying Xue , Yuxiao Wang and more

Potential Business Impact:

Helps computers understand what's happening in videos.

Business Areas:

Image Recognition Data and Analytics, Software

The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.

Generalized Visual Relation Detection with Diffusion Models

CV and Pattern Recognition

Helps computers see relationships beyond labels.

16 Apr 2025 2

90%

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

CV and Pattern Recognition

Teaches computers to see new object actions.

6 Jun 2025 0

89%

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

CV and Pattern Recognition

Lets computers see and understand anything in videos.

18 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

Helps computers understand what's happening in videos.

Technical Abstract

Generalized Visual Relation Detection with Diffusion Models

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking