Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception
By: Marcel Simon, Tae-Ho Kim, Seul-Ki Yeom
Potential Business Impact:
Teaches computers to understand videos by predicting what happens next.
Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.
Similar Papers
Object-level Self-Distillation for Vision Pretraining
CV and Pattern Recognition
Teaches computers to see objects, not just pictures.
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation
CV and Pattern Recognition
Helps doctors see heart problems in ultrasound.
FRAME: Pre-Training Video Feature Representations via Anticipation and Memory
CV and Pattern Recognition
Helps computers understand videos better.