Learning Multi-frame and Monocular Prior for Estimating Geometry in Dynamic Scenes
By: Seong Hyeon Park, Jinwoo Shin
Potential Business Impact:
Makes videos show 3D shapes of moving things.
In monocular videos that capture dynamic scenes, estimating the 3D geometry of video contents has been a fundamental challenge in computer vision. Specifically, the task is significantly challenged by the object motion, where existing models are limited to predict only partial attributes of the dynamic scenes, such as depth or pointmaps spanning only over a pair of frames. Since these attributes are inherently noisy under multiple frames, test-time global optimizations are often employed to fully recover the geometry, which is liable to failure and incurs heavy inference costs. To address the challenge, we present a new model, coined MMP, to estimate the geometry in a feed-forward manner, which produces a dynamic pointmap representation that evolves over multiple frames. Specifically, based on the recent Siamese architecture, we introduce a new trajectory encoding module to project point-wise dynamics on the representation for each frame, which can provide significantly improved expressiveness for dynamic scenes. In our experiments, we find MMP can achieve state-of-the-art quality in feed-forward pointmap prediction, e.g., 15.1% enhancement in the regression error.
Similar Papers
The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos
CV and Pattern Recognition
Helps cameras understand moving things in videos better.
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction
CV and Pattern Recognition
Tracks moving things in 3D video.
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos
CV and Pattern Recognition
Lets robots understand how things move from one video.