Score: 0

Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

Published: June 3, 2025 | arXiv ID: 2506.03388v1

By: Pengyu Chen , Xiao Huang , Teng Fei and more

Potential Business Impact:

Maps cities by listening to their sounds.

Business Areas:

Smart Cities Real Estate

Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.

Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

CV and Pattern Recognition

Find exact locations from photos.

1 Sep 2025 0

88%

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Audio and Speech Processing

Helps computers understand sounds and sights together.

8 Sep 2025 0

88%

Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

Sound

Creates pictures from sounds, even without matching pairs.

9 Jan 2025 2

View PDF Login to Bookmark

Page Count

27 pages

Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

Maps cities by listening to their sounds.

Technical Abstract

Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation