Score: 0

GTMA: Dynamic Representation Optimization for OOD Vision-Language Models

Published: December 20, 2025 | arXiv ID: 2512.18504v1

By: Jensen Zhang, Ningyuan Liu, Keze Wang

Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image's visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model's prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization.

Visual Generation Tuning

CV and Pattern Recognition

Makes smart AI that can create pictures.

28 Nov 2025 2

89%

Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection

CV and Pattern Recognition

Helps computers spot weird things in pictures.

9 Jan 2025 0

89%

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

CV and Pattern Recognition

Makes 3D pictures match words better.

18 Nov 2025 1

View PDF Login to Bookmark

GTMA: Dynamic Representation Optimization for OOD Vision-Language Models

Technical Abstract

Visual Generation Tuning

Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation