Learning Streaming Video Representation via Multitask Training
By: Yibin Yan , Jilan Xu , Shangzhe Di and more
Potential Business Impact:
Helps robots and cars understand moving pictures instantly.
Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions. To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Similar Papers
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
CV and Pattern Recognition
Helps self-driving cars see future events.
StreamingVLM: Real-Time Understanding for Infinite Video Streams
CV and Pattern Recognition
Lets computers watch long videos in real-time.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
CV and Pattern Recognition
Teaches computers to understand what you're looking at.