Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation
By: Mariia Drozdova , Erica Lastufka , Vitaliy Kinakh and more
Potential Business Impact:
AI learns to sort space pictures with words.
Vision-Language Models (VLMs), such as recent Qwen and Gemini models, are positioned as general-purpose AI systems capable of reasoning across domains. Yet their capabilities in scientific imaging, especially on unfamiliar and potentially previously unseen data distributions, remain poorly understood. In this work, we assess whether generic VLMs, presumed to lack exposure to astronomical corpora, can perform morphology-based classification of radio galaxies using the MiraBest FR-I/FR-II dataset. We explore prompting strategies using natural language and schematic diagrams, and, to the best of our knowledge, we are the first to introduce visual in-context examples within prompts in astronomy. Additionally, we evaluate lightweight supervised adaptation via LoRA fine-tuning. Our findings reveal three trends: (i) even prompt-based approaches can achieve good performance, suggesting that VLMs encode useful priors for unfamiliar scientific domains; (ii) however, outputs are highly unstable, i.e. varying sharply with superficial prompt changes such as layout, ordering, or decoding temperature, even when semantic content is held constant; and (iii) with just 15M trainable parameters and no astronomy-specific pretraining, fine-tuned Qwen-VL achieves near state-of-the-art performance (3% Error rate), rivaling domain-specific models. These results suggest that the apparent "reasoning" of VLMs often reflects prompt sensitivity rather than genuine inference, raising caution for their use in scientific domains. At the same time, with minimal adaptation, generic VLMs can rival specialized models, offering a promising but fragile tool for scientific discovery.
Similar Papers
Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning
CV and Pattern Recognition
Helps self-driving cars understand roads better.
A Recipe for Improving Remote Sensing VLM Zero Shot Generalization
CV and Pattern Recognition
Teaches computers to understand satellite pictures.
Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation
CV and Pattern Recognition
Helps computers understand pictures better by combining words and images.