Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
By: Earl Ranario, Mason J. Earles
Potential Business Impact:
Helps farmers spot plant problems better.
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
Similar Papers
Zero-shot image privacy classification with Vision-Language Models
CV and Pattern Recognition
Makes computers better at guessing private pictures.
Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models
CV and Pattern Recognition
Finds planes in pictures better, even blurry ones.
Image Recognition with Vision and Language Embeddings of VLMs
CV and Pattern Recognition
Helps computers understand pictures better with words or just sight.