GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection
By: Yi Zhang , Yi Wang , Lei Yao and more
Potential Business Impact:
Finds objects in 3D using only pictures.
Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).
Similar Papers
Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
CV and Pattern Recognition
Makes 3D pictures look real in messy places.
Automated 3D-GS Registration and Fusion via Skeleton Alignment and Gaussian-Adaptive Features
CV and Pattern Recognition
Combines 3D scenes perfectly for robots.
C3G: Learning Compact 3D Representations with 2K Gaussians
CV and Pattern Recognition
Builds detailed 3D worlds from few pictures.