Leveraging Prediction Entropy for Automatic Prompt Weighting in Zero-Shot Audio-Language Classification

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high sensitivity of audio language models to textual prompt phrasing in zero-shot classification, where minor wording variations can lead to substantial performance fluctuations. To mitigate this issue, the authors propose an unsupervised prompt weighting method based on prediction entropy, introducing entropy as a proxy metric for prompt quality for the first time. By formulating an objective function that minimizes entropy, the method dynamically adjusts prompt weights to enhance prediction confidence. The approach requires no additional annotations, incurs low computational overhead, and supports both batch-wise and sample-wise adaptive strategies. Evaluated across five audio datasets spanning environmental, urban, and human sounds, the method achieves up to a fivefold improvement in zero-shot classification accuracy over conventional prompt ensembling baselines.

Technology Category

Application Category

📝 Abstract
Audio-language models have recently demonstrated strong zero-shot capabilities by leveraging natural-language supervision to classify audio events without labeled training data. Yet, their performance is highly sensitive to the wording of text prompts, with small variations leading to large fluctuations in accuracy. Prior work has mitigated this issue through prompt learning or prompt ensembling. However, these strategies either require annotated data or fail to account for the fact that some prompts may negatively impact performance. In this work, we present an entropy-guided prompt weighting approach that aims to find a robust combination of prompt contributions to maximize prediction confidence. To this end, we formulate a tailored objective function that minimizes prediction entropy to yield new prompt weights, utilizing low-entropy as a proxy for high confidence. Our approach can be applied to individual samples or a batch of audio samples, requiring no additional labels and incurring negligible computational overhead. Experiments on five audio classification datasets covering environmental, urban, and vocal sounds, demonstrate consistent gains compared to classical prompt ensembling methods in a zero-shot setting, with accuracy improvements 5-times larger across the whole benchmark.
Problem

Research questions and friction points this paper is trying to address.

zero-shot audio classification
prompt sensitivity
audio-language models
prediction entropy
prompt weighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

prediction entropy
prompt weighting
zero-shot audio-language classification
entropy minimization
prompt ensembling
🔎 Similar Papers
No similar papers found.
K
K. E. Khoury
ICTEAM, UCLouvain, Belgium
Maxime Zanella
Maxime Zanella
PhD student, Catholic University of Louvain
computer visionmachine learningdeep learning
T
Tiffanie Godelaine
ICTEAM, UCLouvain, Belgium
C
C. Vleeschouwer
ICTEAM, UCLouvain, Belgium
B
Benoit M. Macq
ICTEAM, UCLouvain, Belgium