Tracking by Predicting 3-D Gaussians Over Time
By: Tanish Baranwal , Himanshu Gaurav Singh , Jathushan Rajasegaran and more
Potential Business Impact:
Tracks moving objects in videos automatically.
We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.
Similar Papers
Gaussian Masked Autoencoders
CV and Pattern Recognition
Teaches computers to understand pictures' depth and layers.
Recurrent Video Masked Autoencoders
CV and Pattern Recognition
Teaches computers to understand videos better.
Towards Efficient General Feature Prediction in Masked Skeleton Modeling
CV and Pattern Recognition
Teaches computers to understand human actions better.