ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection
By: Ke Ma , Jun Long , Hongxiao Fei and more
Potential Business Impact:
Finds weird things in pictures better.
Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.
Similar Papers
Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP
CV and Pattern Recognition
Finds hidden problems in pictures using words.
AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration
CV and Pattern Recognition
Finds weird things without being told what to look for.
AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration
CV and Pattern Recognition
Finds weird things even without knowing them.