🤖 AI Summary
This work addresses key challenges in industrial defect detection—namely, the scarcity of large-scale multi-class datasets, the subjectivity of manual prompts, and coarse-grained vision-language interactions—by introducing MMIOC-1M, the first unified large-scale multimodal benchmark supporting both open-vocabulary and closed-set detection. The authors propose RTVPNet, a novel model that leverages expert-guided domain projection, energy-based sparse sampling to generate automatic visual prompts, and a bidirectional vision-language interaction mechanism to achieve fine-grained cross-modal alignment and domain adaptation. Evaluated on MMIOC-1M, LVIS, and COCO, the method achieves state-of-the-art performance while maintaining computational efficiency, establishing a new benchmark and strong baseline for industrial inspection tasks.
📝 Abstract
Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.