Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in industrial defect detection—namely, the scarcity of large-scale multi-class datasets, the subjectivity of manual prompts, and coarse-grained vision-language interactions—by introducing MMIOC-1M, the first unified large-scale multimodal benchmark supporting both open-vocabulary and closed-set detection. The authors propose RTVPNet, a novel model that leverages expert-guided domain projection, energy-based sparse sampling to generate automatic visual prompts, and a bidirectional vision-language interaction mechanism to achieve fine-grained cross-modal alignment and domain adaptation. Evaluated on MMIOC-1M, LVIS, and COCO, the method achieves state-of-the-art performance while maintaining computational efficiency, establishing a new benchmark and strong baseline for industrial inspection tasks.
📝 Abstract
Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.
Problem

Research questions and friction points this paper is trying to address.

industrial defect detection
large-scale benchmark
visual-language models
open-vocabulary detection
text-visual interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-Scale Visual-Language Models
Industrial Defect Detection
Open-Closed Unified Benchmark
Text-Visual Prompting
Cross-Modal Interaction
🔎 Similar Papers
Z
Zekai Zhang
School of Control Science and Engineering, Shandong University, Jinan 250061, China
J
Jinglin Zhang
School of Control Science and Engineering, Shandong University, Jinan 250061, China
Q
Qinghui Chen
School of Control Science and Engineering, Shandong University, Jinan 250061, China
Gang Li
Gang Li
Department of Plant and Soil Sciences, University of Delaware
Soil BiogeochemistrySoil-Plant-Microbe InteractionEmerging contaminants
D
Da Chen
CEREMADE, University Paris Dauphine, PSL Research University, CNRS, UMR 7534, 75775 Paris, France
S
Shuainan Jing
Shandong Computer Science Center, Qilu University of Technology, Jinan, China
H
He Wang
Shandong Computer Science Center, Qilu University of Technology, Jinan, China
Dagang Li
Dagang Li
Macau University of Science and Technology
NetworkGraphTime seriesRLLLM
C
Cong Liu
NOVA Information Management School, Nova University of Lisbon, 1070-312 Lisbon, Portugal
C
Cong Bai
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
S
Shengyong Chen
School of Computer Sciences and Engineering, Tianjin University of Technology, Tianjin 300384, China