🤖 AI Summary
This work addresses the high sensitivity of audio language models to textual prompt phrasing in zero-shot classification, where minor wording variations can lead to substantial performance fluctuations. To mitigate this issue, the authors propose an unsupervised prompt weighting method based on prediction entropy, introducing entropy as a proxy metric for prompt quality for the first time. By formulating an objective function that minimizes entropy, the method dynamically adjusts prompt weights to enhance prediction confidence. The approach requires no additional annotations, incurs low computational overhead, and supports both batch-wise and sample-wise adaptive strategies. Evaluated across five audio datasets spanning environmental, urban, and human sounds, the method achieves up to a fivefold improvement in zero-shot classification accuracy over conventional prompt ensembling baselines.
📝 Abstract
Audio-language models have recently demonstrated strong zero-shot capabilities by leveraging natural-language supervision to classify audio events without labeled training data. Yet, their performance is highly sensitive to the wording of text prompts, with small variations leading to large fluctuations in accuracy. Prior work has mitigated this issue through prompt learning or prompt ensembling. However, these strategies either require annotated data or fail to account for the fact that some prompts may negatively impact performance. In this work, we present an entropy-guided prompt weighting approach that aims to find a robust combination of prompt contributions to maximize prediction confidence. To this end, we formulate a tailored objective function that minimizes prediction entropy to yield new prompt weights, utilizing low-entropy as a proxy for high confidence. Our approach can be applied to individual samples or a batch of audio samples, requiring no additional labels and incurring negligible computational overhead. Experiments on five audio classification datasets covering environmental, urban, and vocal sounds, demonstrate consistent gains compared to classical prompt ensembling methods in a zero-shot setting, with accuracy improvements 5-times larger across the whole benchmark.