Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
By: Chuhan Zhang , Guillaume Le Moing , Skanda Koppula and more
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
Similar Papers
Flux4D: Flow-based Unsupervised 4D Reconstruction
CV and Pattern Recognition
Builds 3D worlds from videos in seconds.
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
CV and Pattern Recognition
Lets self-driving cars see and remember 3D scenes.
PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
CV and Pattern Recognition
Helps cameras understand moving things in 3D.