🤖 AI Summary
Existing self-supervised speech language identification (LID) models struggle to unify dialects and accents of the same language. This paper proposes the first geography-aware self-supervised LID framework: it introduces geographic location prediction—aligned with the language level—as an auxiliary task, and injects the predicted geographic vector as a conditional signal into intermediate representations to guide the model toward learning more robust, unified language representations. The method is formulated within a multi-task self-supervised learning paradigm and evaluated on six multilingual datasets. It achieves a state-of-the-art 97.7% accuracy on the FLEURS benchmark and yields a 9.7% relative improvement on the ML-SUPERB 2.0 dialect benchmark. The core contribution is the first integration of geographic priors into self-supervised LID, enhancing dialect- and accent-invariance through conditional representation learning.
📝 Abstract
While Self-supervised Learning (SSL) has significantly improved Spoken Language Identification (LID), existing models often struggle to consistently classify dialects and accents of the same language as a unified class. To address this challenge, we propose geolocation-aware LID, a novel approach that incorporates language-level geolocation information into the SSL-based LID model. Specifically, we introduce geolocation prediction as an auxiliary task and inject the predicted vectors into intermediate representations as conditioning signals. This explicit conditioning encourages the model to learn more unified representations for dialectal and accented variations. Experiments across six multilingual datasets demonstrate that our approach improves robustness to intra-language variations and unseen domains, achieving new state-of-the-art accuracy on FLEURS (97.7%) and 9.7% relative improvement on ML-SUPERB 2.0 dialect set.