City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) struggle with large-scale outdoor scene understanding due to architectural limitations—namely, reliance on single-view 2D inputs and insufficient support for multi-view (e.g., bird’s-eye and ground-level), multi-modal (e.g., images and point clouds), and cross-scale outdoor perception data. To address this, we introduce SVM-City, the first outdoor-scene benchmark enabling instruction-tuned learning across multiple scales, views, and modalities. We further propose an incomplete multimodal learning framework that achieves robust 2D/3D heterogeneous data fusion via a shared probabilistic latent space, eliminating dependence on modality completeness inherent in conventional concatenation-based methods. Our model incorporates a cross-modal alignment encoder and a probabilistic fusion mechanism. Evaluated on three representative outdoor tasks, it achieves an average 18.14% improvement in question-answering performance over state-of-the-art LVLMs, demonstrating significantly enhanced generalization and deep semantic understanding of urban environments.

Technology Category

Application Category

📝 Abstract
Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named extbf{underline{SVM-City}}, deriving from multi extbf{underline{S}}cale scenarios with multi extbf{underline{V}}iew and multi extbf{underline{M}}odal instruction tuning data. It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named extbf{underline{City-VLM}}. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves $18.14 %$ performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.
Problem

Research questions and friction points this paper is trying to address.

Outdoor scene understanding lacks multidomain perception data integration
Existing LVLMs fail in large-scale 2D and 3D visual fusion
Insufficient models for multi-view outdoor scenarios compared to indoor
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multidomain perception dataset SVM-City for outdoor scenes
Incomplete multimodal learning for missing data fusion
Joint probabilistic distribution space for multimodal fusion
🔎 Similar Papers
No similar papers found.
P
Penglei Sun
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Y
Yaoxian Song
Zhejiang University, Hangzhou, China
Xiangru Zhu
Xiangru Zhu
Fudan University
cross-modal alignmentmulti-modal understandingmulti-modal generation
X
Xiang Liu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Q
Qiang Wang
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Y
Yue Liu
Terminus Technologies Co., Ltd., Chongqing, China
Changqun Xia
Changqun Xia
Associate Professor @ Peng Cheng Laboratory
Compute Vision and Pattern Recognition
Tiefeng Li
Tiefeng Li
Zhejiang University
solid mechanicssoft mattersoft roboticssmart material and structures
Y
Yang Yang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Xiaowen Chu
Xiaowen Chu
IEEE Fellow, Professor, Data Science and Analytics, HKUST(GZ)
GPU ComputingMachine Learning SystemsParallel and Distributed ComputingWireless Networks