UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio encoders are typically specialized for a single domain—such as speech, environmental sounds, or music—and lack general-purpose representational capabilities. This work proposes UniWhisper, the first framework to introduce instruction tuning into universal audio representation learning. By unifying heterogeneous audio tasks into an instruction–response format and employing a continual multi-task training strategy, UniWhisper enables cross-domain joint modeling within a single Whisper encoder, without requiring task-specific heads or loss functions. Experimental results demonstrate that UniWhisper significantly outperforms the original Whisper across 20 diverse tasks, achieving MLP probe and kNN normalized weighted average scores of 0.81 and 0.61, respectively, compared to Whisper’s 0.64 and 0.46. Moreover, UniWhisper maintains strong speech performance while offering efficient inference and robust generalization.

Technology Category

Application Category

📝 Abstract
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.
Problem

Research questions and friction points this paper is trying to address.

universal audio representation
multi-task learning
audio encoding
cross-domain generalization
speech and environmental sound
Innovation

Methods, ideas, or system contributions that make the work stand out.

universal audio representation
continual multi-task training
instruction-based learning
next-token prediction
audio encoder
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Chen
Jilin University
P
Peize He
University of Electronic Science and Technology of China
H
Haoyuan Xu
Hunan University
Junzi Zhang
Junzi Zhang
Citadel Securities; Amazon; Stanford University
optimizationoperations researchmachine learning