Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In multi-label audio classification, global pooling (e.g., CLS token) discards fine-grained, event-local information, causing linear probes to misrepresent embedding quality—stemming from a fundamental mismatch between global pretraining objectives and local downstream task requirements. To address this, we propose the Binary Prototype Probe (BPP), which replaces global pooling with fine-grained, learnable class prototypes to enable local semantic aggregation while keeping self-supervised learning (SSL) models frozen. BPP is the first probe method whose performance matches full-model fine-tuning, establishing a new paradigm for audio SSL evaluation. Extensive experiments across 13 benchmark datasets and 6 spectrogram-based encoders demonstrate that BPP significantly outperforms both linear and attention-based probes, achieving superior accuracy, computational efficiency, and cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $ exttt{cls}$-token discards crucial token information about dispersed, localized events in multi-label audio. This weakness is rooted in the mismatch between the pretraining objective (operating globally) and the downstream task (localized events). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we first investigate the global pooling bottleneck. We then introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Addresses information bottleneck in audio classification global pooling

Solves mismatch between pretraining objectives and localized audio events

Proposes competitive probing alternative to costly fine-tuning paradigms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binarized prototypical probes replace global pooling

Learns prototypes for class-wise information aggregation

Lightweight pooling method outperforms linear attentive probing

🔎 Similar Papers

Compositional Audio Representation Learning