X-ray illicit object detection using hybrid CNN-transformer neural network architectures

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address insufficient detection robustness under severe occlusion and deliberate concealment of prohibited items in X-ray security screening, as well as performance degradation caused by domain shift (e.g., on the EDS dataset), this work systematically investigates, for the first time, the modeling capability of CNN-Transformer hybrid architectures for prohibited item detection. We evaluate multiple hybrid backbone–detector combinations—including HGNetV2 and Next-ViT-S backbones paired with YOLOv8 and RT-DETR heads—across three public benchmarks: EDS, HiXray, and PIDray. Experiments demonstrate that hybrid architectures significantly improve cross-domain generalization, achieving a +5.2% mAP gain on EDS and reducing small-object miss-detection by 12.7%. We further reveal the critical role of local–global collaborative modeling in enhancing structural robustness. All code, pretrained weights, and architecture selection guidelines are publicly released, establishing a reproducible hybrid modeling paradigm for X-ray vision-based detection.

Technology Category

Application Category

📝 Abstract

In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.

Problem

Research questions and friction points this paper is trying to address.

Detecting heavily occluded objects in X-ray security images

Exploring hybrid CNN-transformer architectures for improved detection

Evaluating robustness against domain shifts in X-ray datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-transformer for X-ray detection

Combines HGNetV2 and Next-ViT-S backbones

Evaluates YOLOv8 and RT-DETR detection heads

🔎 Similar Papers

Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification