Visuospatial Cognitive Assistant
By: Qi Feng
Potential Business Impact:
Helps robots understand and move in real places.
Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.
Similar Papers
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
CV and Pattern Recognition
Helps computers understand where things are.
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
CV and Pattern Recognition
Teaches computers to understand where things are.
Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
CV and Pattern Recognition
Teaches computers to understand space from videos.