The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
By: Asif Azad , Mohammad Sadat Hossain , MD Sadik Hossain Shanto and more
Potential Business Impact:
Helps AI know when it's unsure about answers.
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
Similar Papers
Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models
CV and Pattern Recognition
Helps AI know when it's wrong.
Zero-shot image privacy classification with Vision-Language Models
CV and Pattern Recognition
Makes computers better at guessing private pictures.
Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere
Machine Learning (CS)
Helps computers understand pictures and words better.