🤖 AI Summary
This study addresses the scarcity of high-quality audio-location paired data that has hindered research in audio geolocation. To this end, we introduce AGL1K, the first benchmark for audio geolocation, spanning 72 countries and regions, and propose a novel metric—“audio localizability”—to curate 1,444 high-quality audio clips for evaluation. Leveraging this benchmark, we conduct a systematic analysis of 16 audio language models (ALMs), examining their regional biases, reasoning pathways, and reliance on linguistic cues. Our findings reveal that closed-source models significantly outperform open-source counterparts and that predictions are predominantly driven by language-related signals rather than acoustic or environmental features. AGL1K thus provides a robust foundation for evaluating and advancing the geospatial reasoning capabilities of audio language models.
📝 Abstract
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs'reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.