InspectVLM: Unified in Theory, Unreliable in Practice
By: Conor Wallace, Isaac Corley, Jonathan Lwowski
Potential Business Impact:
Makes computers see and understand factory flaws.
Unified vision-language models (VLMs) promise to streamline computer vision pipelines by reframing multiple visual tasks such as classification, detection, and keypoint localization within a single language-driven interface. This architecture is particularly appealing in industrial inspection, where managing disjoint task-specific models introduces complexity, inefficiency, and maintenance overhead. In this paper, we critically evaluate the viability of this unified paradigm using InspectVLM, a Florence-2-based VLM trained on InspectMM, our new large-scale multimodal, multitask inspection dataset. While InspectVLM performs competitively on image-level classification and structured keypoint tasks, we find that it fails to match traditional ResNet-based models in core inspection metrics. Notably, the model exhibits brittle behavior under low prompt variability, produces degenerate outputs for fine-grained object detection, and frequently defaults to memorized language responses regardless of visual input. Our findings suggest that while language-driven unification offers conceptual elegance, current VLMs lack the visual grounding and robustness necessary for deployment in precision critical industrial inspections.
Similar Papers
Object Detection with Multimodal Large Vision-Language Models: An In-depth Review
CV and Pattern Recognition
Lets computers see and understand pictures better.
Image Recognition with Vision and Language Embeddings of VLMs
CV and Pattern Recognition
Helps computers understand pictures better with words or just sight.
Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model
CV and Pattern Recognition
Teaches computers to remember old and new pictures.