VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes
By: Shengkai Zhang , Yuhe Liu , Guanjun Wu and more
Potential Business Impact:
Creates realistic 3D views from single photos.
VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.
Similar Papers
LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration Rendering
CV and Pattern Recognition
Maps moving outdoor places accurately, even with changing objects.
MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild
CV and Pattern Recognition
Makes 3D pictures from few photos.
Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework
CV and Pattern Recognition
Makes 3D pictures from few photos.