Seeing without Pixels: Perception from Camera Trajectories
By: Zihui Xue , Kristen Grauman , Dima Damen and more
Potential Business Impact:
Camera movement alone can guess what's in a video.
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
Similar Papers
VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos
CV and Pattern Recognition
Makes videos with amazing camera moves.
From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
CV and Pattern Recognition
Finds your location on a map from a photo.
Unified Camera Positional Encoding for Controlled Video Generation
CV and Pattern Recognition
Makes videos follow camera movements perfectly.