Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of zero-shot defect detection in industrial settings, where significant domain shifts from natural images, scarce annotated data, and reliance on handcrafted prompts in existing vision-language models hinder performance. To tackle these challenges, the authors introduce MMIO, the first large-scale multimodal open dataset tailored for industrial zero-shot learning, and propose the Refined Text-Visual Prompting (RTVP) method. RTVP integrates expert-guided domain adaptation of large language models with an automated visual prompt generation mechanism. Evaluated on MMIO, RTVP achieves 42.2% AP in the zero-shot setting and 24.7% AP under closed-set conditions, substantially outperforming current approaches and establishing the first benchmark for industrial zero-shot defect detection.
📝 Abstract
Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Learning
Industrial Defect Detection
Large Visual Language Models
Prompt Engineering
Industrial Benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Learning
Industrial Defect Detection
Large Visual Language Models
Text-Visual Prompting
Domain Adaptation
🔎 Similar Papers
No similar papers found.
Z
Zekai Zhang
School of Control Science and Engineering, Shandong University, Jinan, China
Q
Qinghui Chen
School of Control Science and Engineering, Shandong University, Jinan, China
M
Maomao Xiong
School of Control Science and Engineering, Shandong University, Jinan, China
S
Shijiao Ding
School of Control Science and Engineering, Shandong University, Jinan, China
Z
Zhanzhi Su
National Supercomputing Center in Jinan, Qilu University of Technology, Jinan, China
X
Xinjie Yao
College of Intelligence and Computing, Tianjin University, Tianjin, China
Yiming Sun
Yiming Sun
Southeast University
Multi-modal LearningComputer Vision
C
Cong Bai
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China
J
Jinglin Zhang
School of Control Science and Engineering, Shandong University, Jinan, China