Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
By: Itay Cohen, Ethan Fetaya, Amir Rosenfeld
Potential Business Impact:
Helps computers tell real things from fake ones.
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
Similar Papers
Assessing the alignment between infants' visual and linguistic experience using multimodal language models
CV and Pattern Recognition
Helps babies learn words by watching and listening.
Relational Visual Similarity
CV and Pattern Recognition
Teaches computers to see how things are alike.
Vision language models have difficulty recognizing virtual objects
CV and Pattern Recognition
AI struggles to imagine unseen objects in pictures.