EasyARC: Evaluating Vision Language Models on True Visual Reasoning
By: Mert Unsal, Aylin Akkus
Potential Business Impact:
Teaches computers to understand pictures and words together.
Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state-of-the-art vision-language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models. We open-source our benchmark dataset and evaluation code.
Similar Papers
ARC Is a Vision Problem!
CV and Pattern Recognition
Helps computers solve visual puzzles like humans.
Think Visually, Reason Textually: Vision-Language Synergy in ARC
CV and Pattern Recognition
Teaches computers to learn like humans do.
Do AI Models Perform Human-like Abstract Reasoning Across Modalities?
Artificial Intelligence
AI struggles to truly understand and reason like humans.