🤖 AI Summary
To address the limited capability of vision-language models (VLMs) in video spatial cognition tasks—such as 3D metadata querying and complex spatiotemporal reasoning—this paper introduces ViCA, the first VLM family specialized for video spatial reasoning. We construct two novel datasets: ViCA-322K, a video QA benchmark grounded in real indoor scenes, and ViCA-Thinking-2.68K, an explainable chain-of-thought (CoT) dataset with annotated spatial reasoning traces. Leveraging these, we supervise-fine-tune a 7B-parameter VLM via 3D metadata alignment, multi-step spatial reasoning modeling, and CoT distillation, yielding ViCA-7B and its reasoning-enhanced variant, ViCA-7B-Thinking. On all eight tasks of the VSI-Bench benchmark, ViCA achieves new state-of-the-art performance, improving absolute distance accuracy by +26.1 points. Crucially, ViCA exhibits interpretable, step-by-step spatial reasoning. Code, datasets, and models are publicly released.
📝 Abstract
Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.