A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses a critical gap in existing text-guided anomaly detection methods: the lack of rigorous evaluation regarding whether language inputs genuinely influence model decisions, leading to an overestimation of their controllability. To this end, the authors propose TGAD (Text-Guided Anomaly Detection), a structured benchmark comprising three progressively challenging scenarios and a novel evaluation protocol that, for the first time, enables differentiation between superficial and substantive language involvement in decision-making. They also introduce APD, a new dataset featuring real-world industrial images with part-level annotations. Systematic experiments across three representative method categories—generative large vision-language models, training-free discriminative approaches, and embedding-adaptive discriminative methods—reveal that current models are sensitive only to superficial prompt variations, struggle to follow part-level instructions, and can perform below random chance in complex settings.

📝 Abstract

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

Problem

Research questions and friction points this paper is trying to address.

text-guided anomaly detection

vision-language models

industrial inspection

multimodal benchmark

language conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided anomaly detection

structured benchmark

vision-language models