Distilling a speech and music encoder with task arithmetic

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current self-supervised models treat speech and music representation learning separately, hindering unified audio understanding (e.g., audio large language models), while end-to-end joint training incurs prohibitive computational costs. To address this, we propose a unified encoder construction method combining task vector distillation with linear interpolation. We introduce the first task vector distillation paradigm, decoupling domain-specific knowledge from pre-trained speech (wav2vec 2.0) and music (MusicBERT) models. By linearly combining their distilled task vectors with learnable, adjustable weights, our method dynamically balances speech- versus music-oriented representation preferences—without requiring joint fine-tuning, thus significantly reducing training overhead. Evaluated across multiple benchmarks, our model achieves superior cross-domain generalization compared to conventional ensemble distillation approaches, demonstrating breakthroughs in both representational quality and architectural flexibility for unified audio modeling.

Technology Category

Application Category

📝 Abstract

Despite the progress in self-supervised learning (SSL) for speech and music, existing models treat these domains separately, limiting their capacity for unified audio understanding. A unified model is desirable for applications that require general representations, e.g. audio large language models. Nonetheless, directly training a general model for speech and music is computationally expensive. Knowledge Distillation of teacher ensembles may be a natural solution, but we posit that decoupling the distillation of the speech and music SSL models allows for more flexibility. Thus, we propose to learn distilled task vectors and then linearly interpolate them to form a unified speech+music model. This strategy enables flexible domain emphasis through adjustable weights and is also simpler to train. Experiments on speech and music benchmarks demonstrate that our method yields superior overall performance compared to ensemble distillation.

Problem

Research questions and friction points this paper is trying to address.

Unified audio understanding for speech and music

Computationally expensive general model training

Flexible domain emphasis with distilled task vectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling task vectors for speech and music

Linearly interpolating vectors for unified model

Adjustable weights for flexible domain emphasis

🔎 Similar Papers

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders