MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
By: Yang Shi , Yifeng Xie , Minzhe Guo and more
Potential Business Impact:
Helps computers spot mistakes in their thinking.
Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io
Similar Papers
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
Computation and Language
Teaches computers to spot fake or wrong pictures.
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
CV and Pattern Recognition
Teaches computers to understand video stories better.
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
CV and Pattern Recognition
Tests AI's ability to explain and fix its mistakes.