🤖 AI Summary
Current audio-language models struggle to jointly model speech, general audio events, and music within a unified architecture; moreover, reliance solely on cross-entropy loss leads to weak cross-modal alignment, as it neglects audio feature redundancy. To address these limitations, we propose U-SAM, a Unified Speech-Audio-Music language model. First, U-SAM introduces a novel Mixture-of-Experts-based dynamic routing mechanism that selectively integrates specialized encoders for speech, audio events, and music. Second, it employs a semantic-aware contrastive loss that explicitly identifies and suppresses redundant audio representations, thereby enhancing fine-grained audio–text alignment. Evaluated across diverse benchmarks—including automatic speech recognition (ASR), audio classification, and music understanding—U-SAM consistently outperforms both task-specific models and existing audio-language models. Furthermore, it demonstrates strong generalization capabilities and emergent zero-shot task performance.
📝 Abstract
The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equally and fails to account for redundant audio features, leading to weaker cross-modal alignment. To deal with the above challenges, this paper introduces U-SAM, an advanced audio language model that integrates specialized encoders for speech, audio, and music with a pre-trained large language model (LLM). U-SAM employs a Mixture of Experts (MoE) projector for task-aware feature fusion, dynamically routing and integrating the domain-specific encoder outputs. Additionally, U-SAM incorporates a Semantic-Aware Contrastive Loss Module, which explicitly identifies redundant audio features under language supervision and rectifies their semantic and spectral representations to enhance cross-modal alignment. Extensive experiments demonstrate that U-SAM consistently outperforms both specialized models and existing audio language models across multiple benchmarks. Moreover, it exhibits emergent capabilities on unseen tasks, showcasing its generalization potential. Code is available (https://github.com/Honee-W/U-SAM/).