U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio-language models struggle to jointly model speech, general audio events, and music within a unified architecture; moreover, reliance solely on cross-entropy loss leads to weak cross-modal alignment, as it neglects audio feature redundancy. To address these limitations, we propose U-SAM, a Unified Speech-Audio-Music language model. First, U-SAM introduces a novel Mixture-of-Experts-based dynamic routing mechanism that selectively integrates specialized encoders for speech, audio events, and music. Second, it employs a semantic-aware contrastive loss that explicitly identifies and suppresses redundant audio representations, thereby enhancing fine-grained audio–text alignment. Evaluated across diverse benchmarks—including automatic speech recognition (ASR), audio classification, and music understanding—U-SAM consistently outperforms both task-specific models and existing audio-language models. Furthermore, it demonstrates strong generalization capabilities and emergent zero-shot task performance.

Technology Category

Application Category

📝 Abstract
The text generation paradigm for audio tasks has opened new possibilities for unified audio understanding. However, existing models face significant challenges in achieving a comprehensive understanding across diverse audio types, such as speech, general audio events, and music. Furthermore, their exclusive reliance on cross-entropy loss for alignment often falls short, as it treats all tokens equally and fails to account for redundant audio features, leading to weaker cross-modal alignment. To deal with the above challenges, this paper introduces U-SAM, an advanced audio language model that integrates specialized encoders for speech, audio, and music with a pre-trained large language model (LLM). U-SAM employs a Mixture of Experts (MoE) projector for task-aware feature fusion, dynamically routing and integrating the domain-specific encoder outputs. Additionally, U-SAM incorporates a Semantic-Aware Contrastive Loss Module, which explicitly identifies redundant audio features under language supervision and rectifies their semantic and spectral representations to enhance cross-modal alignment. Extensive experiments demonstrate that U-SAM consistently outperforms both specialized models and existing audio language models across multiple benchmarks. Moreover, it exhibits emergent capabilities on unseen tasks, showcasing its generalization potential. Code is available (https://github.com/Honee-W/U-SAM/).
Problem

Research questions and friction points this paper is trying to address.

Achieving unified understanding across speech, audio, and music
Improving cross-modal alignment beyond cross-entropy loss limitations
Addressing redundant audio features for better semantic representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates specialized encoders with pre-trained LLM
Uses Mixture of Experts for dynamic feature fusion
Employs Semantic-Aware Contrastive Loss for alignment
🔎 Similar Papers
No similar papers found.
Z
Ziqian Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, China
X
Xianjun Xia
ByteDance, China
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Software, Northwestern Polytechnical University, China