Visuospatial Cognitive Assistant

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the limited capability of vision-language models (VLMs) in video spatial cognition tasks—such as 3D metadata querying and complex spatiotemporal reasoning—this paper introduces ViCA, the first VLM family specialized for video spatial reasoning. We construct two novel datasets: ViCA-322K, a video QA benchmark grounded in real indoor scenes, and ViCA-Thinking-2.68K, an explainable chain-of-thought (CoT) dataset with annotated spatial reasoning traces. Leveraging these, we supervise-fine-tune a 7B-parameter VLM via 3D metadata alignment, multi-step spatial reasoning modeling, and CoT distillation, yielding ViCA-7B and its reasoning-enhanced variant, ViCA-7B-Thinking. On all eight tasks of the VSI-Bench benchmark, ViCA achieves new state-of-the-art performance, improving absolute distance accuracy by +26.1 points. Crucially, ViCA exhibits interpretable, step-by-step spatial reasoning. Code, datasets, and models are publicly released.

Technology Category

Application Category

📝 Abstract

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

Problem

Research questions and friction points this paper is trying to address.

Addresses video-based spatial cognition challenges in Vision-Language Models

Introduces ViCA dataset for 3D metadata-grounded queries and reasoning

Develops ViCA-7B model for state-of-the-art visuospatial task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ViCA-322K dataset for 3D queries

Develops ViCA-7B model for spatial reasoning

Presents ViCA-Thinking-2.68K for interpretable reasoning

🔎 Similar Papers

No similar papers found.