Leveraging Prediction Entropy for Automatic Prompt Weighting in Zero-Shot Audio-Language Classification
By: Karim El Khoury , Maxime Zanella , Tiffanie Godelaine and more
Potential Business Impact:
Makes AI better at understanding sounds with words.
Audio-language models have recently demonstrated strong zero-shot capabilities by leveraging natural-language supervision to classify audio events without labeled training data. Yet, their performance is highly sensitive to the wording of text prompts, with small variations leading to large fluctuations in accuracy. Prior work has mitigated this issue through prompt learning or prompt ensembling. However, these strategies either require annotated data or fail to account for the fact that some prompts may negatively impact performance. In this work, we present an entropy-guided prompt weighting approach that aims to find a robust combination of prompt contributions to maximize prediction confidence. To this end, we formulate a tailored objective function that minimizes prediction entropy to yield new prompt weights, utilizing low-entropy as a proxy for high confidence. Our approach can be applied to individual samples or a batch of audio samples, requiring no additional labels and incurring negligible computational overhead. Experiments on five audio classification datasets covering environmental, urban, and vocal sounds, demonstrate consistent gains compared to classical prompt ensembling methods in a zero-shot setting, with accuracy improvements 5-times larger across the whole benchmark.
Similar Papers
Which Words Matter Most in Zero-Shot Prompts?
Computation and Language
Finds which words make AI understand instructions best.
Prompt-aware classifier free guidance for diffusion models
Sound
Makes AI images and sounds better by guessing the best settings.
Improving Audio Classification by Transitioning from Zero- to Few-Shot
Sound
Helps computers better guess sounds using fewer examples.