Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates cross-modal correspondences between acoustic and visual scenes in urban environments to uncover ecological and socio-spatial information embedded in soundscapes. Conducting multi-source sensing experiments across London, New York, and Tokyo, we integrate street-level imagery, remote sensing data, and geotagged audio recordings. Methodologically, we systematically compare the performance of embedded visual representations (CLIP, RemoteCLIP) versus segmentation-based representations (CLIPSeg, Seg-Earth OV) for acoustic-semantic alignment—introducing a Boundary-Guided Alignment (BGA) analytical framework. Results demonstrate that street-view embeddings better capture holistic acoustic scene semantics, whereas remote sensing–based segmentation significantly enhances interpretable discrimination among three key eco-acoustic components: biophony, geophony, and anthropophony. This work establishes a novel multimodal urban sensing paradigm and provides methodological foundations for cross-modal environmental monitoring and interpretation.

Technology Category

Application Category

📝 Abstract
Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.
Problem

Research questions and friction points this paper is trying to address.

Evaluating sound-vision alignment in urban environments
Comparing visual representation strategies for acoustic semantics
Integrating sound and imagery for geospatial analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal approach integrates sound and imagery
Uses AST, CLIP, RemoteCLIP for cross-modal analysis
Embedding-based models enhance semantic sound-vision alignment
P
Pengyu Chen
Department of Geography, University of South Carolina, Columbia, SC 29208, USA
X
Xiao Huang
Department of Environmental Sciences, Emory University, Atlanta, GA 30322, USA
Teng Fei
Teng Fei
School of Resources and Environmental Science, Wuhan University
Remote SensingGISSocial SensingPlanningNatural Resources
S
Sicheng Wang
Department of Geography, University of South Carolina, Columbia, SC 29208, USA