🤖 AI Summary
Existing audio encoders are typically specialized for a single domain—such as speech, environmental sounds, or music—and lack general-purpose representational capabilities. This work proposes UniWhisper, the first framework to introduce instruction tuning into universal audio representation learning. By unifying heterogeneous audio tasks into an instruction–response format and employing a continual multi-task training strategy, UniWhisper enables cross-domain joint modeling within a single Whisper encoder, without requiring task-specific heads or loss functions. Experimental results demonstrate that UniWhisper significantly outperforms the original Whisper across 20 diverse tasks, achieving MLP probe and kNN normalized weighted average scores of 0.81 and 0.61, respectively, compared to Whisper’s 0.64 and 0.46. Moreover, UniWhisper maintains strong speech performance while offering efficient inference and robust generalization.
📝 Abstract
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.