Score: 0

Visually Prompted Benchmarks Are Surprisingly Fragile

Published: December 19, 2025 | arXiv ID: 2512.17875v1

By: Haiwen Feng , Long Lian , Lisa Dunlap and more

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

CV and Pattern Recognition

Tests how well computers understand videos.

22 Mar 2025 1

90%

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

CV and Pattern Recognition

Helps computers read measuring tools accurately.

30 Oct 2025 1

90%

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

CV and Pattern Recognition

Tests if AI can be fooled by tricky pictures.

18 Nov 2025 3

View PDF Login to Bookmark

Visually Prompted Benchmarks Are Surprisingly Fragile

Technical Abstract

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs