KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old?
By: Xianfeng Wang , Kaiwei Zhang , Qi Jia and more
Potential Business Impact:
Computers can't see like young kids yet.
While Multimodal Large Language Models (MLLMs) have demonstrated impressive proficiency in high-level reasoning tasks, such as complex diagrammatic interpretation, it remains an open question whether they possess the fundamental visual primitives comparable to human intuition. To investigate this, we introduce KidVis, a novel benchmark grounded in the theory of human visual development. KidVis deconstructs visual intelligence into six atomic capabilities - Concentration, Tracking, Discrimination, Memory, Spatial, and Closure - already possessed by 6-7 year old children, comprising 10 categories of low-semantic-dependent visual tasks. Evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity. Results indicate that while human children achieve a near-perfect average score of 95.32, the state-of-the-art GPT-5 attains only 67.33. Crucially, we observe a "Scaling Law Paradox": simply increasing model parameters fails to yield linear improvements in these foundational visual capabilities. This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
Similar Papers
BabyVision: Visual Reasoning Beyond Language
CV and Pattern Recognition
Teaches computers to see like toddlers.
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
CV and Pattern Recognition
Helps computers understand pictures like people do.
Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?
Computation and Language
Tests if AI can see simple shapes and patterns.