STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing X-ray baggage screening CAS systems are constrained by closed-set classification paradigms and training datasets misaligned with real-world threat distributions, limiting their capability against concealed and heterogeneous security threats. To address this, we introduce STCray—the first multimodal image-text paired dataset for X-ray security screening—comprising 46,642 samples across 21 threat categories. We further propose STING-BEE, a domain-adaptive vision-language model supporting open-ended tasks including threat localization, visual grounding, and visual question answering. Our method pioneers a novel multimodal instruction-data construction paradigm for X-ray screening, integrating X-ray physics modeling, vision-language alignment, instruction tuning, and domain-adaptive pretraining. This breaks the closed-set constraint, enabling open-vocabulary understanding and cross-domain generalization. STING-BEE achieves state-of-the-art performance across multiple X-ray multimodal benchmarks. All data, code, and models are publicly released.

Technology Category

Application Category

📝 Abstract

Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.

Problem

Research questions and friction points this paper is trying to address.

Limited real-world threat representation in X-ray baggage datasets

Closed-set paradigm constraints with predefined labels

Need for multimodal learning in security inspection tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal X-ray baggage security dataset STCray

Domain-aware visual AI assistant STING-BEE

Supports vision-language tasks in security inspection

🔎 Similar Papers

No similar papers found.

Authors to Follow