Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation
By: Mehrdad Noori , David Osowiechi , Gustavo Adolfo Vargas Hakim and more
Potential Business Impact:
Helps computers understand pictures better, even new ones.
Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, seven segmentation datasets, and 15 common corruptions, with a total of 82 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines.
Similar Papers
Realistic Test-Time Adaptation of Vision-Language Models
CV and Pattern Recognition
Helps AI understand new things without extra training.
Segmentation Assisted Incremental Test Time Adaptation in an Open World
CV and Pattern Recognition
Helps AI learn new things without stopping.
CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation
CV and Pattern Recognition
Helps AI understand new pictures better.