🤖 AI Summary
Existing few-shot audio classification evaluations overlook models’ reliance on spurious contextual cues such as background environments, thereby failing to expose their fragile generalization. This work introduces SpurAudio, a benchmark that leverages the inherent separability of foreground events and background in audio to establish an evaluation protocol with controllable context shifts between support and query sets. For the first time, it systematically reveals that state-of-the-art few-shot methods—including large-scale pretrained audio foundation models—suffer significant performance degradation when background correlations are disrupted, highlighting their sensitivity to spurious associations and uncovering fundamental differences in how their features interact with classification heads. The study demonstrates that conventional evaluation protocols harbor critical blind spots and underscores the urgent need for benchmarks specifically designed to probe contextual dependencies.
📝 Abstract
Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.