Unexplored flaws in multiple-choice VQA evaluations
By: Fabio Rosenthal , Sebastian Schmidt , Thorsten Graf and more
Potential Business Impact:
Makes AI answers change just by changing the question's words.
Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.
Similar Papers
Evaluating Variance in Visual Question Answering Benchmarks
CV and Pattern Recognition
Makes AI answers more trustworthy and consistent.
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models
CV and Pattern Recognition
Fixes AI's tendency to pick wrong answers.
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
Computation and Language
Makes AI understand questions better, not just guess.