Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
By: Chenshuang Zhang , Kang Zhang , Joon Son Chung and more
Potential Business Impact:
Helps computers track moving, identical things.
Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.
Similar Papers
Point Prompting: Counterfactual Tracking with Video Diffusion Models
CV and Pattern Recognition
Tracks moving dots in videos without training.
Fitting Image Diffusion Models on Video Datasets
CV and Pattern Recognition
Makes AI create videos that look more real.
Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model
CV and Pattern Recognition
Makes computer-generated dancing look more real.