Score: 0

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Published: August 11, 2025 | arXiv ID: 2508.07819v4

By: Ke Ma , Jun Long , Hongxiao Fei and more

Potential Business Impact:

Finds weird things in pictures better.

Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.

Page Count
5 pages

Category
Computer Science:
CV and Pattern Recognition