MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VPR datasets predominantly rely on vehicle-mounted imagery, exhibiting limited multimodal diversity and insufficient representation of non-Western, densely populated pedestrian streets. To address this, we propose MMS-VPR—the first street-level multimodal VPR benchmark tailored for complex pedestrian environments. It covers a 70,000 m² open commercial district in Chengdu, comprising 207 locations, 78,000 images, and 2,512 video clips, with synchronized GPS, timestamps, and textual metadata; a spatial graph structure (81 nodes, 125 edges) is also constructed. Our method innovatively fuses visual, spatiotemporal, textual, and graph-structural modalities; introduces a low-barrier, reproducible data collection protocol; and defines two evaluation subsets—Edges and Points—for fine-grained and graph-aware assessment. Experiments demonstrate substantial recall improvements over conventional VPR models, GNNs, and multimodal baselines, validating the critical role of structural priors and multimodal synergy in recognizing non-Western urban dense scenes. The dataset is publicly released.

Technology Category

Application Category

📝 Abstract
Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, lack multimodal diversity and underrepresent dense, mixed-use street-level spaces, especially in non-Western urban contexts. To address these gaps, we introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in complex, pedestrian-only environments. The dataset comprises 78,575 annotated images and 2,512 video clips captured across 207 locations in a ~70,800 $mathrm{m}^2$ open-air commercial district in Chengdu, China. Each image is labeled with precise GPS coordinates, timestamp, and textual metadata, and covers varied lighting conditions, viewpoints, and timeframes. MMS-VPR follows a systematic and replicable data collection protocol with minimal device requirements, lowering the barrier for scalable dataset creation. Importantly, the dataset forms an inherent spatial graph with 125 edges, 81 nodes, and 1 subgraph, enabling structure-aware place recognition. We further define two application-specific subsets -- Dataset_Edges and Dataset_Points -- to support fine-grained and graph-based evaluation tasks. Extensive benchmarks using conventional VPR models, graph neural networks, and multimodal baselines show substantial improvements when leveraging multimodal and structural cues. MMS-VPR facilitates future research at the intersection of computer vision, geospatial understanding, and multimodal reasoning. The dataset is publicly available at https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR.
Problem

Research questions and friction points this paper is trying to address.

Lack of multimodal diversity in existing VPR datasets
Underrepresentation of dense, mixed-use street-level spaces
Need for scalable dataset creation in non-Western urban contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset for street-level place recognition
Systematic data collection with minimal device requirements
Inherent spatial graph enabling structure-aware recognition