SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking
By: Sifan Li, Yujun Cai, Yiwei Wang
Potential Business Impact:
Makes computers see hidden things in pictures.
Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
Similar Papers
Vision language models are unreliable at trivial spatial cognition
CV and Pattern Recognition
Computers struggle to tell what's left or right.
Vision language models have difficulty recognizing virtual objects
CV and Pattern Recognition
AI struggles to imagine unseen objects in pictures.
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
CV and Pattern Recognition
Helps computers understand pictures better by focusing on important parts.