UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing audio encoders are typically specialized for a single domain—such as speech, environmental sounds, or music—and lack general-purpose representational capabilities. This work proposes UniWhisper, the first framework to introduce instruction tuning into universal audio representation learning. By unifying heterogeneous audio tasks into an instruction–response format and employing a continual multi-task training strategy, UniWhisper enables cross-domain joint modeling within a single Whisper encoder, without requiring task-specific heads or loss functions. Experimental results demonstrate that UniWhisper significantly outperforms the original Whisper across 20 diverse tasks, achieving MLP probe and kNN normalized weighted average scores of 0.81 and 0.61, respectively, compared to Whisper’s 0.64 and 0.46. Moreover, UniWhisper maintains strong speech performance while offering efficient inference and robust generalization.

Technology Category

Application Category

📝 Abstract

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

Problem

Research questions and friction points this paper is trying to address.

universal audio representation

multi-task learning

audio encoding

cross-domain generalization

speech and environmental sound

Innovation

Methods, ideas, or system contributions that make the work stand out.

universal audio representation

continual multi-task training

instruction-based learning