ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
By: Liyan Tang , Grace Kim , Xinyu Zhao and more
Potential Business Impact:
Helps computers understand charts better.
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
Similar Papers
EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts
CV and Pattern Recognition
Helps computers better understand charts and graphs.
InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
Computation and Language
Helps computers understand many charts together.
Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
Computation and Language
Makes AI understand charts by asking "what if".