Score: 1

FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

Published: June 5, 2025 | arXiv ID: 2506.05543v1

By: Sethuraman TV , Savya Khosla , Vignesh Srinivasakumar and more

Potential Business Impact:

Helps computers understand videos better.

Business Areas:

Image Recognition Data and Analytics, Software

Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction

CV and Pattern Recognition

Improves video tracking by using more past pictures.

30 Apr 2025 0

87%

Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models

CV and Pattern Recognition

Predicts future video frames more accurately.

28 Jul 2025 0

87%

Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception

CV and Pattern Recognition

Teaches computers to understand videos by predicting what happens next.

25 Jul 2025 0

View PDF Login to Bookmark

Page Count

20 pages

FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

Helps computers understand videos better.

Technical Abstract

Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction

Benefits of Feature Extraction and Temporal Sequence Analysis for Video Frame Prediction: An Evaluation of Hybrid Deep Learning Models

Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception