Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline
By: Linqing Zhao , Xiuwei Xu , Yirui Wang and more
Potential Business Impact:
Makes 3D models from videos much faster.
Plain English Summary
Imagine you're filming a video on your phone, and you want to create a perfect 3D model of the room you're in. This new method lets your phone build that 3D model in real-time as you film, without needing special cameras or lots of processing time. This means future apps could create incredibly detailed 3D scans of anything you point your phone at, making virtual tours or augmented reality experiences much more realistic and accessible.
Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90\%.
Similar Papers
On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs
CV and Pattern Recognition
Builds 3D worlds from many cameras fast.
Gaussian-Plus-SDF SLAM: High-fidelity 3D Reconstruction at 150+ fps
CV and Pattern Recognition
Makes 3D maps of rooms much faster.
ProDyG: Progressive Dynamic Scene Reconstruction via Gaussian Splatting from Monocular Videos
CV and Pattern Recognition
Builds 3D worlds from videos in real-time.