Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click
By: Raphael Ruschel , Hardikkumar Prajapati , Awsafur Rahman and more
Potential Business Impact:
Lets you tell computers what's happening in videos.
State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.
Similar Papers
View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs
CV and Pattern Recognition
Helps robots find objects using words.
SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis
CV and Pattern Recognition
Makes surgery training videos more realistic and controllable.
VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation
CV and Pattern Recognition
Helps videos understand what's happening and why.