π€ AI Summary
This work addresses the challenge of balancing CLIPβs strong generalization capabilities with fine-grained sensitivity to local image regions in zero-shot anomaly detection. To this end, the authors propose MoECLIP, a method built upon a Mixture-of-Experts (MoE) architecture that dynamically assigns dedicated low-rank adaptation (LoRA) experts to individual image patches. MoECLIP incorporates a Frozen Orthogonal Feature Separation mechanism to enforce complementary feature learning among experts and employs a Simplex Equiangular Tight Frame (ETF) loss to encourage expert outputs to form maximally equiangular representations, thereby mitigating functional redundancy. Extensive experiments across 14 industrial and medical benchmark datasets demonstrate that MoECLIP significantly outperforms current state-of-the-art methods, confirming its effectiveness and robust generalization ability.
π Abstract
The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose \textbf{MoECLIP}, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at https://github.com/CoCoRessa/MoECLIP.