Score: 0

Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Published: August 11, 2025 | arXiv ID: 2508.07819v1

By: Ke Ma , Jun Long , Hongxiao Fei and more

Potential Business Impact:

Finds hidden problems in pictures using words.

Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

CV and Pattern Recognition

Finds weird things in pictures better.

11 Aug 2025 0

91%

Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models

CV and Pattern Recognition

Helps doctors find diseases in CT scans faster.

29 Nov 2025 0

90%

Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

CV and Pattern Recognition

Lets computers understand pictures even in bad light.

8 Mar 2025 0

View PDF Login to Bookmark

Page Count

5 pages

Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Finds hidden problems in pictures using words.

Technical Abstract

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models

Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models