🤖 AI Summary
To address insufficient detection robustness under severe occlusion and deliberate concealment of prohibited items in X-ray security screening, as well as performance degradation caused by domain shift (e.g., on the EDS dataset), this work systematically investigates, for the first time, the modeling capability of CNN-Transformer hybrid architectures for prohibited item detection. We evaluate multiple hybrid backbone–detector combinations—including HGNetV2 and Next-ViT-S backbones paired with YOLOv8 and RT-DETR heads—across three public benchmarks: EDS, HiXray, and PIDray. Experiments demonstrate that hybrid architectures significantly improve cross-domain generalization, achieving a +5.2% mAP gain on EDS and reducing small-object miss-detection by 12.7%. We further reveal the critical role of local–global collaborative modeling in enhancing structural robustness. All code, pretrained weights, and architecture selection guidelines are publicly released, establishing a reproducible hybrid modeling paradigm for X-ray vision-based detection.
📝 Abstract
In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at https://github.com/jgenc/xray-comparative-evaluation.