🤖 AI Summary
In multi-label audio classification, global pooling (e.g., CLS token) discards fine-grained, event-local information, causing linear probes to misrepresent embedding quality—stemming from a fundamental mismatch between global pretraining objectives and local downstream task requirements. To address this, we propose the Binary Prototype Probe (BPP), which replaces global pooling with fine-grained, learnable class prototypes to enable local semantic aggregation while keeping self-supervised learning (SSL) models frozen. BPP is the first probe method whose performance matches full-model fine-tuning, establishing a new paradigm for audio SSL evaluation. Extensive experiments across 13 benchmark datasets and 6 spectrogram-based encoders demonstrate that BPP significantly outperforms both linear and attention-based probes, achieving superior accuracy, computational efficiency, and cross-dataset generalization.
📝 Abstract
Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $ exttt{cls}$-token discards crucial token information about dispersed, localized events in multi-label audio. This weakness is rooted in the mismatch between the pretraining objective (operating globally) and the downstream task (localized events). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate the global pooling bottleneck. We then introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.